Files
honeyDueAPI/deploy-k3s/scripts/03-deploy.sh
T
Trey t 139a990ebc
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00

239 lines
9.5 KiB
Bash
Executable File

#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=_config.sh
source "${SCRIPT_DIR}/_config.sh"
REPO_DIR="$(cd "${DEPLOY_DIR}/.." && pwd)"
NAMESPACE="honeydue"
MANIFESTS="${DEPLOY_DIR}/manifests"
log() { printf '[deploy] %s\n' "$*"; }
warn() { printf '[deploy][warn] %s\n' "$*" >&2; }
die() { printf '[deploy][error] %s\n' "$*" >&2; exit 1; }
# --- Parse arguments ---
SKIP_BUILD=false
DEPLOY_TAG=""
while (( $# > 0 )); do
case "$1" in
--skip-build) SKIP_BUILD=true; shift ;;
--tag)
[[ -n "${2:-}" ]] || die "--tag requires a value"
DEPLOY_TAG="$2"; shift 2 ;;
-h|--help)
cat <<'EOF'
Usage: ./scripts/03-deploy.sh [OPTIONS]
Options:
--skip-build Skip Docker build/push, use existing images
--tag <tag> Image tag (default: git short SHA)
-h, --help Show this help
EOF
exit 0 ;;
*) die "Unknown argument: $1" ;;
esac
done
# --- Prerequisites ---
command -v kubectl >/dev/null 2>&1 || die "Missing: kubectl"
command -v docker >/dev/null 2>&1 || die "Missing: docker"
if [[ -z "${DEPLOY_TAG}" ]]; then
DEPLOY_TAG="$(git -C "${REPO_DIR}" rev-parse --short HEAD 2>/dev/null || echo "latest")"
fi
# --- Read registry config ---
REGISTRY_SERVER="$(cfg_require registry.server "Container registry server")"
REGISTRY_NS="$(cfg_require registry.namespace "Registry namespace")"
REGISTRY_USER="$(cfg_require registry.username "Registry username")"
REGISTRY_TOKEN="$(cfg_require registry.token "Registry token")"
REGISTRY_PREFIX="${REGISTRY_SERVER%/}/${REGISTRY_NS#/}"
API_IMAGE="${REGISTRY_PREFIX}/honeydue-api:${DEPLOY_TAG}"
WORKER_IMAGE="${REGISTRY_PREFIX}/honeydue-worker:${DEPLOY_TAG}"
ADMIN_IMAGE="${REGISTRY_PREFIX}/honeydue-admin:${DEPLOY_TAG}"
WEB_IMAGE="${REGISTRY_PREFIX}/honeydue-web:${DEPLOY_TAG}"
# The web client lives in a sibling repo. Resolve its path once.
WEB_REPO_DIR="$(cd "${REPO_DIR}/../honeyDueAPI-Web" 2>/dev/null && pwd || echo "")"
# NEXT_PUBLIC_* is baked into client bundles at build time, so API/PostHog
# URLs must be passed as build-args — setting them at pod runtime has no
# effect on already-bundled JS.
API_DOMAIN="$(cfg_require domains.api "API domain")"
ADMIN_API_URL="https://${API_DOMAIN}"
WEB_API_URL="https://${API_DOMAIN}/api"
# PostHog keys for the web client are optional — read from operator shell
# env so they never land in a committed file. Empty disables analytics.
: "${NEXT_PUBLIC_POSTHOG_KEY:=}"
: "${NEXT_PUBLIC_POSTHOG_HOST:=}"
# --- Build and push ---
if [[ "${SKIP_BUILD}" == "false" ]]; then
log "Logging in to ${REGISTRY_SERVER}..."
printf '%s' "${REGISTRY_TOKEN}" | docker login "${REGISTRY_SERVER}" -u "${REGISTRY_USER}" --password-stdin >/dev/null
# k3s nodes are linux/amd64 (Hetzner CX). Force the build platform so
# local arm64 Macs don't push images that crash with "exec format error".
BUILD_PLATFORM="linux/amd64"
log "Building API image: ${API_IMAGE} (${BUILD_PLATFORM})"
docker build --platform "${BUILD_PLATFORM}" --target api -t "${API_IMAGE}" "${REPO_DIR}"
log "Building Worker image: ${WORKER_IMAGE} (${BUILD_PLATFORM})"
docker build --platform "${BUILD_PLATFORM}" --target worker -t "${WORKER_IMAGE}" "${REPO_DIR}"
log "Building Admin image: ${ADMIN_IMAGE} (${BUILD_PLATFORM}, NEXT_PUBLIC_API_URL=${ADMIN_API_URL})"
docker build --platform "${BUILD_PLATFORM}" --target admin \
--build-arg "NEXT_PUBLIC_API_URL=${ADMIN_API_URL}" \
-t "${ADMIN_IMAGE}" "${REPO_DIR}"
if [[ -n "${WEB_REPO_DIR}" && -f "${WEB_REPO_DIR}/Dockerfile" ]]; then
log "Building Web image: ${WEB_IMAGE} (${BUILD_PLATFORM}, NEXT_PUBLIC_API_URL=${WEB_API_URL})"
docker build --platform "${BUILD_PLATFORM}" \
--build-arg "NEXT_PUBLIC_API_URL=${WEB_API_URL}" \
--build-arg "NEXT_PUBLIC_POSTHOG_KEY=${NEXT_PUBLIC_POSTHOG_KEY}" \
--build-arg "NEXT_PUBLIC_POSTHOG_HOST=${NEXT_PUBLIC_POSTHOG_HOST}" \
-t "${WEB_IMAGE}" "${WEB_REPO_DIR}"
else
warn "honeyDueAPI-Web sibling repo not found at ${WEB_REPO_DIR:-<unset>}; skipping web build"
fi
log "Pushing images..."
docker push "${API_IMAGE}"
docker push "${WORKER_IMAGE}"
docker push "${ADMIN_IMAGE}"
[[ -n "${WEB_REPO_DIR}" && -f "${WEB_REPO_DIR}/Dockerfile" ]] && docker push "${WEB_IMAGE}"
# Also tag and push :latest
docker tag "${API_IMAGE}" "${REGISTRY_PREFIX}/honeydue-api:latest"
docker tag "${WORKER_IMAGE}" "${REGISTRY_PREFIX}/honeydue-worker:latest"
docker tag "${ADMIN_IMAGE}" "${REGISTRY_PREFIX}/honeydue-admin:latest"
docker push "${REGISTRY_PREFIX}/honeydue-api:latest"
docker push "${REGISTRY_PREFIX}/honeydue-worker:latest"
docker push "${REGISTRY_PREFIX}/honeydue-admin:latest"
if [[ -n "${WEB_REPO_DIR}" && -f "${WEB_REPO_DIR}/Dockerfile" ]]; then
docker tag "${WEB_IMAGE}" "${REGISTRY_PREFIX}/honeydue-web:latest"
docker push "${REGISTRY_PREFIX}/honeydue-web:latest"
fi
else
warn "Skipping build. Using images for tag: ${DEPLOY_TAG}"
fi
# --- Generate and apply ConfigMap from config.yaml ---
log "Generating env from config.yaml..."
ENV_FILE="$(mktemp)"
trap 'rm -f "${ENV_FILE}"' EXIT
generate_env > "${ENV_FILE}"
log "Creating ConfigMap..."
kubectl create configmap honeydue-config \
--namespace="${NAMESPACE}" \
--from-env-file="${ENV_FILE}" \
--dry-run=client -o yaml | kubectl apply -f -
# --- Apply manifests ---
log "Applying manifests..."
kubectl apply -f "${MANIFESTS}/namespace.yaml"
# NetworkPolicies first — default-deny-all + per-app allow rules.
# These MUST be applied; without them the cluster falls back to default-allow
# (worse posture) AND the vmagent egress rule for :6443 (which fixes a k3s
# post-DNAT enforcement quirk for k8s API discovery) is missing.
# See deploy-k3s/RUNBOOK.md ("vmagent SD broken on fresh deploy").
kubectl apply -f "${MANIFESTS}/network-policies.yaml"
kubectl apply -f "${MANIFESTS}/redis/"
kubectl apply -f "${MANIFESTS}/ingress/"
# --- Run migrations BEFORE rolling api/worker ---
#
# goose-based migration Job. We delete any prior Job (Jobs are immutable —
# applying a duplicate name otherwise fails), apply a fresh one with the new
# api image (which includes /usr/local/bin/goose and /app/migrations), and
# block until it succeeds. A failure aborts the deploy before any new app
# pod sees a stale schema.
log "Running database migrations (goose Job)..."
kubectl delete job honeydue-migrate -n "${NAMESPACE}" --ignore-not-found --wait=true >/dev/null
sed "s|image: IMAGE_PLACEHOLDER|image: ${API_IMAGE}|" "${MANIFESTS}/migrate/job.yaml" | kubectl apply -f -
if ! kubectl wait --namespace="${NAMESPACE}" --for=condition=complete --timeout=10m job/honeydue-migrate; then
warn "migration Job failed — see logs:"
kubectl logs -n "${NAMESPACE}" job/honeydue-migrate --tail=200 || true
die "migrations did not complete cleanly; aborting deploy"
fi
log "Migrations applied; proceeding with api/worker rollout"
# Apply deployments with image substitution
sed "s|image: IMAGE_PLACEHOLDER|image: ${API_IMAGE}|" "${MANIFESTS}/api/deployment.yaml" | kubectl apply -f -
kubectl apply -f "${MANIFESTS}/api/service.yaml"
kubectl apply -f "${MANIFESTS}/api/hpa.yaml"
sed "s|image: IMAGE_PLACEHOLDER|image: ${WORKER_IMAGE}|" "${MANIFESTS}/worker/deployment.yaml" | kubectl apply -f -
sed "s|image: IMAGE_PLACEHOLDER|image: ${ADMIN_IMAGE}|" "${MANIFESTS}/admin/deployment.yaml" | kubectl apply -f -
kubectl apply -f "${MANIFESTS}/admin/service.yaml"
if [[ -d "${MANIFESTS}/web" ]]; then
sed "s|image: IMAGE_PLACEHOLDER|image: ${WEB_IMAGE}|" "${MANIFESTS}/web/deployment.yaml" | kubectl apply -f -
kubectl apply -f "${MANIFESTS}/web/service.yaml"
fi
# Observability — vmagent scrapes api Pods :8000/metrics + kube-state-metrics
# :8080/metrics and remote-writes everything to obs.88oakapps.com. The bearer
# token comes from deploy/prod.env so it stays out of the repo; the manifest
# holds TOKEN_PLACEHOLDER. kube-state-metrics provides the kube_* metrics
# Grafana panels need to count pods, deployments, etc.
if [[ -d "${MANIFESTS}/observability" ]]; then
# kube-state-metrics — no secrets, plain apply
kubectl apply -f "${MANIFESTS}/observability/kube-state-metrics.yaml"
# vmagent — needs the bearer-token substitution
# prod.env lives at the repo's deploy/ dir (sibling of deploy-k3s/), not
# under deploy-k3s/. It's gitignored — operator copies values there once.
OBS_TOKEN="$(grep -E '^OBS_INGEST_TOKEN=' "${REPO_DIR}/deploy/prod.env" 2>/dev/null | cut -d= -f2- || true)"
if [[ -z "${OBS_TOKEN}" ]]; then
warn "OBS_INGEST_TOKEN not found in deploy/prod.env — skipping vmagent apply"
else
sed "s|TOKEN_PLACEHOLDER|${OBS_TOKEN}|" "${MANIFESTS}/observability/vmagent.yaml" | kubectl apply -f -
fi
fi
# --- Wait for rollouts ---
log "Waiting for rollouts..."
kubectl rollout status deployment/redis -n "${NAMESPACE}" --timeout=120s
kubectl rollout status deployment/api -n "${NAMESPACE}" --timeout=300s
kubectl rollout status deployment/worker -n "${NAMESPACE}" --timeout=300s
kubectl rollout status deployment/admin -n "${NAMESPACE}" --timeout=300s
if [[ -d "${MANIFESTS}/web" ]]; then
kubectl rollout status deployment/web -n "${NAMESPACE}" --timeout=300s
fi
if kubectl -n "${NAMESPACE}" get deployment vmagent >/dev/null 2>&1; then
kubectl rollout status deployment/vmagent -n "${NAMESPACE}" --timeout=120s
fi
# --- Done ---
log ""
log "Deploy completed successfully."
log "Tag: ${DEPLOY_TAG}"
log "Images:"
log " API: ${API_IMAGE}"
log " Worker: ${WORKER_IMAGE}"
log " Admin: ${ADMIN_IMAGE}"
[[ -d "${MANIFESTS}/web" ]] && log " Web: ${WEB_IMAGE}"
log ""
log "Run ./scripts/04-verify.sh to check cluster health."