Files

T

Trey t e448ec66dc docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs

Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover:

- Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions,
  network/firewall, DNS, filesystem layout — all reflect the live OVH
  install instead of the historical Hetzner setup.
- Section 6: canonical install-from-clean-boxes procedure (the literal
  commands run on 2026-06-03), so anyone can stand up a backup cluster
  by following along.
- Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away,
  healthy-but-empty) and adds four new ones discovered during the OVH
  build: rbac.yaml not in 03-deploy.sh, namespace label missing from api
  metrics (use service="api"), cluster-label collision when two clusters
  push concurrently, worker double-firing on cutover.
- Section 11.1: enumerates Tier-3 observability gaps surfaced while
  building the honeydue-eli5-overview dashboard (node-exporter not
  deployed, Traefik metrics off, push success counters absent, worker
  /metrics endpoint absent, cache hit rate uninstrumented, APNs latency
  uninstrumented).
- Section 12: dated audit trail of cluster changes.

Pure documentation; no code or manifest changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-03 09:34:35 -05:00

40 KiB

Raw Permalink Blame History

honeyDue k3s Cluster — Operations Runbook

Living document for the honeyDue production cluster. Add entries when you hit something non-obvious so future-you (or your replacement) doesn't have to rediscover it.

Last full revision: 2026-06-03 (Hetzner → OVH BHS cutover; cluster solo production from that date forward). For pre-OVH history, see MIGRATION_NOTES.md (Swarm → k3s migration on Hetzner, 2026-04-24).

1. Topology and inventory

Hosting


Provider	OVHcloud (us.ovhcloud.com)
Datacenter	BHS — Beauharnois, Quebec, Canada
Plan	VPS-1 × 3 (~$6.46/mo each, ~$19/mo total)
Node spec	4 vCPU (Intel Haswell, shared), 7.6 GB RAM, 75 GB NVMe
Public bandwidth	400 Mbps per node, unlimited traffic
Private network	None. Nodes have public IPv4 + IPv6 only; inter-node traffic crosses the public internet (encrypted by flannel WireGuard backend — see §3)

Nodes

SSH alias	Kubernetes node name	Public IPv4	Public IPv6	Roles
`ovhcloud1`	`vps-1624d691`	`51.81.83.33`	`2604:2dc0:101:200::5a9a`	control-plane, etcd, redis-pinned
`ovhcloud2`	`vps-c0f51be2`	`51.81.87.86`	`2604:2dc0:101:200::30d4`	control-plane, etcd
`ovhcloud3`	`vps-dbca24c7`	`51.81.85.248`	`2604:2dc0:101:200::450f`	control-plane, etcd

The cluster is all-control-plane (workloads schedule on the same nodes that run etcd and the API server). vps-1624d691 carries the honeydue/redis=true label so the Redis Deployment's nodeSelector binds there; the Redis PVC (local-path, host-pinned) lives on that node's disk.

SSH access

~/.ssh/config entries (operator workstation):

Host ovhcloud1
    HostName 51.81.83.33
    Port 22
    User ubuntu
    IdentityFile ~/.ssh/ovhcloud
    IdentitiesOnly yes
Host ovhcloud2
    HostName 51.81.87.86
    Port 22
    User ubuntu
    IdentityFile ~/.ssh/ovhcloud
    IdentitiesOnly yes
Host ovhcloud3
    HostName 51.81.85.248
    Port 22
    User ubuntu
    IdentityFile ~/.ssh/ovhcloud
    IdentitiesOnly yes

ubuntu has passwordless sudo (/etc/sudoers.d/90-cloud-init-users from OVH's cloud-init).

kubectl access

export KUBECONFIG=/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/deploy-k3s/kubeconfig
kubectl get nodes

The deploy-k3s/kubeconfig file (mode 0600, gitignored) is the OVH cluster's admin kubeconfig with server: https://51.81.83.33:6443. A stale Hetzner copy lives next to it as kubeconfig.hetzner.bak for historical reference; the Hetzner cluster is powered off and that file's API server is unreachable.

To refresh from the cluster (if the local copy is lost or rotated):

ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig

The k3s API at :6443 is open to the public internet (token-protected).

2. Software

Kernel-level


OS	Ubuntu 26.04 LTS (set by OVH's VPS-1 image)
Kernel	`7.0.0-14-generic`
Init	systemd
Container runtime	containerd 2.2.2 (bundled with k3s)
Firewall	`ufw` (per-node, configured at install — see §3)
Other host packages	`fail2ban` (SSH brute-force protection, default jail), `unattended-upgrades` (security updates), `open-iscsi` (k3s prereq for some storage backends), `curl`

Kubernetes


Distribution	k3s
Version	`v1.34.6+k3s1` (pinned in `config.yaml:cluster.k3s_version`)
Control plane	3-node HA, embedded etcd (no external Postgres backing store)
CNI / networking	flannel with WireGuard-native backend (`--flannel-backend=wireguard-native`). Encrypts pod-to-pod and etcd peer traffic because nodes only have public IPs (no private network). ~3-5% CPU overhead under load.
Service LB	klipper-lb (default k3s `servicelb`). The `svclb-traefik` DaemonSet binds host ports `:80` and `:443` on each node and forwards to the Traefik Service. Not the DaemonSet-w/-hostNetwork Traefik pattern used on the old Hetzner cluster — see §10 Differences from MIGRATION_NOTES.
Ingress controller	Traefik (k3s default), single-replica Deployment, exposed via klipper-lb
DNS	CoreDNS (k3s default)
Secrets encryption	Enabled (`--secrets-encryption`); etcd values are AES-CBC encrypted at rest
kubeconfig perms	`0600` (`--write-kubeconfig-mode=0600`)
Cloud controller	Disabled (`--disable-cloud-controller`) — no provider integration on OVH
Misc	`--node-ip` / `--node-external-ip` / `--advertise-address` all set to each node's public IPv4. TLS SANs cover all 3 IPs so any IP can serve the API.

Application stack (in cluster, `honeydue` namespace)

Deployment	Replicas	Image (digest-pinned)	Notes
`api`	3	`gitea.treytartt.com/admin/honeydue-api@sha256:34fde6...`	Go REST API on `:8000`, exposes `/metrics`
`web`	3	`gitea.treytartt.com/admin/honeydue-web@sha256:8c62cf...`	Next.js, server-side proxy to api
`admin`	1	`gitea.treytartt.com/admin/honeydue-admin@sha256:b81263...`	Next.js admin panel, gated behind Traefik basic-auth
`worker`	1	`gitea.treytartt.com/admin/honeydue-worker@sha256:fe1f5e...`	Asynq scheduler + Redis-backed jobs (singleton — must not run as >1 replica or every cron fires N×)
`redis`	1	`redis:7-alpine@sha256:6ab0b6...`	Pinned to `vps-1624d691` via `honeydue/redis=true`. PVC `redis-data` (local-path, 5 Gi). Password-auth required.
`vmagent`	1	`victoriametrics/vmagent@sha256:...` (default tag)	Scrapes api `/metrics` + kube-state-metrics; remote-writes to obs.88oakapps.com
`kube-state-metrics`	1	`kube-state-metrics@sha256:...`	In `kube-system`, scraped by vmagent for `kube_*` cluster-state metrics
`alloy-logs` (DaemonSet)	3 (1/node)	`grafana/alloy@sha256:...`	Tails `/var/log/pods/*` and ships to Loki at obs.88oakapps.com

The Asynq scheduler inside worker registers these cron jobs:

Cron	Job	Notes
`0 * * * *`	Smart reminder check (per-user hour)	Default user hour: 14:00 UTC
`0 * * * *`	Daily digest check (per-user hour)	Default user hour: 03:00 UTC
`0 10 * * *`	Onboarding emails	10:00 UTC
`0 3 * * *`	Reminder log cleanup	03:00 UTC
`30 * * * *`	Pending uploads cleanup	xx:30 every hour

External dependencies

Service	Endpoint	Purpose	Failure mode
Neon Postgres	`ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech:5432`	App data. Pooler endpoint (transaction-mode PgBouncer in front of Neon compute) so connections stay warm.	api / worker pods crash-loop with `dial tcp: connection refused`. Health endpoint returns `postgres: error`.
Backblaze B2 (S3-compatible)	`s3.us-east-005.backblazeb2.com` (bucket `honeyDueProd`)	User uploads (photos, PDFs, completion attachments)	Upload routes return 5xx; reads of cached/static files still work.
Cloudflare	`myhoneydue.com` zone	DNS + TLS termination + edge cache + DDoS	Traffic stops reaching origin. Direct `https://51.81.x.x` still works for diagnostics.
obs.88oakapps.com	Operator-run Grafana + VictoriaMetrics + Loki	Metrics & logs	vmagent + alloy-logs back off and retry. No app-side impact.
Apple APNs	`api.push.apple.com:443` (production)	iOS push notifications	Push fails; circuit breaker opens; failure logged. App functionality unaffected.
Fastmail SMTP	`smtp.fastmail.com:587`	Transactional emails (verification, recovery, digests)	Email send fails in the worker; logged; user reset/digest flow degrades.
Gitea registry	`gitea.treytartt.com`	Container image registry	Deploys can't pull. Existing pods keep running on cached images.

3. Network and firewall

Per-node `ufw` configuration

Applied during install (same on all 3 nodes):

default deny incoming
default allow outgoing
allow 22/tcp                  (SSH, world)
allow 80/tcp                  (HTTP via Cloudflare, world — see GAP-1)
allow 443/tcp                 (HTTPS, same — GAP-1)
allow 6443/tcp                (k3s API, world, token-protected)
allow 2379:2380/tcp from <other 2 OVH IPs>   (etcd client + peer)
allow 10250/tcp from <other 2 OVH IPs>       (kubelet)
allow 51820/udp from <other 2 OVH IPs>       (WireGuard tunnel)
allow 8472/udp  from <other 2 OVH IPs>       (VXLAN, defense-in-depth fallback)

To inspect: ssh ovhcloudN sudo ufw status numbered.

Cluster networking

Pod CIDR: 10.42.0.0/16 (default k3s)
Service CIDR: 10.43.0.0/16 (default k3s)
Flannel backend: WireGuard-native. Each node hosts a flannel-wg interface on UDP 51820 and tunnels pod traffic to peers. Verify: ssh ovhcloudN ip -d link show flannel-wg.

Traefik ingress flow

Cloudflare → node:80/443 (public)
  → klipper-lb svclb-traefik DaemonSet pod (hostPort:80/443)
  → Traefik Service (ClusterIP 10.43.245.127:80/443)
  → Traefik Deployment pod (single replica)
  → matches Ingress host rule (api.myhoneydue.com etc.)
  → routes to backend Service (api / web / admin)
  → backend Pod

The Traefik default also lives in kube-system and is managed by k3s's HelmChart. No HelmChartConfig override is applied on OVH (unlike Hetzner — see §10).

4. DNS configuration (Cloudflare)

The myhoneydue.com zone in Cloudflare has these public records. All hostnames are proxied (orange cloud) — required by the cloudflare-only Traefik middleware which 403s any non-CF source IP.

Host	Type	Values	Proxy
`api.myhoneydue.com`	A × 3	`51.81.83.33`, `51.81.87.86`, `51.81.85.248`	Proxied
`app.myhoneydue.com`	A × 3	(same trio)	Proxied
`admin.myhoneydue.com`	A × 3	(same trio)	Proxied
`myhoneydue.com` (apex `@`)	A × 3	(same trio)	Proxied

Cloudflare round-robins among the 3 origins, klipper-lb on whichever node CF hits forwards to Traefik, and Traefik routes by Host header. Per-request, effectively load-balanced across the 3 nodes for ingress, with no central LB.

SSL/TLS mode: Flexible (CF terminates TLS at the edge; origin is plain HTTP on :80). Upgrading to Full (strict) is on the deferred list — would need an origin certificate provisioned to cloudflare-origin-cert secret and Traefik configured for TLS termination.

5. Filesystem layout (`deploy-k3s/`)

deploy-k3s/
├── config.yaml                 # Single config source (gitignored; contains tokens)
├── config.yaml.example         # Template
├── kubeconfig                  # OVH admin kubeconfig (gitignored, 0600)
├── kubeconfig.hetzner.bak      # Old Hetzner kubeconfig (unreachable, kept for history)
├── kubeconfig.tunnel           # Optional: localhost-pointing copy for SSH-tunnel use
├── secrets/
│   ├── README.md
│   ├── postgres_password.txt   # Neon DB password
│   ├── secret_key.txt          # 32+ char app-token signing secret
│   ├── email_host_password.txt # Fastmail SMTP app password
│   ├── fcm_server_key.txt      # FCM server key (currently unused — Android push disabled)
│   ├── apns_auth_key.p8        # APNs auth key (binary)
│   ├── cloudflare-origin.crt   # Origin certificate (currently unused — CF Flexible)
│   └── cloudflare-origin.key
│   (all gitignored except README.md)
├── manifests/
│   ├── namespace.yaml
│   ├── network-policies.yaml   # default-deny + per-app egress/ingress (13 NetPols total)
│   ├── rbac.yaml               # api/worker/admin/web/redis ServiceAccounts (NOT applied by 03-deploy.sh; manual once)
│   ├── pod-disruption-budgets.yaml  # api-pdb, web-pdb, worker-pdb (NOT applied by 03-deploy.sh; manual once)
│   ├── traefik-helmchartconfig.yaml # Hetzner-only DaemonSet+hostNetwork override (do NOT apply on OVH; we use default klipper-lb)
│   ├── kyverno-verify-images.yaml   # Operator-gated policy (do NOT apply blindly — see file comment)
│   ├── api/{deployment,service,hpa}.yaml
│   ├── worker/deployment.yaml
│   ├── admin/{deployment,service}.yaml
│   ├── web/{deployment,service}.yaml
│   ├── redis/{deployment,service,pvc}.yaml
│   ├── ingress/{middleware,ingress-simple}.yaml
│   ├── migrate/job.yaml        # goose migration Job (image-subbed at deploy time)
│   ├── observability/{kube-state-metrics,vmagent,alloy-logs}.yaml
│   └── kratos/                 # Ory Kratos identity service (NOT yet deployed; gated on operator OIDC setup)
└── scripts/
    ├── _config.sh              # Sourced by all scripts: cfg(), generate_env(), generate_cluster_config()
    ├── 01-provision-cluster.sh # Hetzner-Cloud-specific (uses hetzner-k3s CLI) — DO NOT RUN ON OVH
    ├── 02-setup-secrets.sh     # Creates honeydue-secrets etc. from secrets/ + config.yaml; kubeconfig-driven
    ├── 03-deploy.sh            # Build + push + apply manifests + roll deployments; kubeconfig-driven
    ├── 04-verify.sh            # Post-deploy health + security checks; kubeconfig-driven
    └── rollback.sh             # `kubectl rollout undo` across all deployments

The deploy/prod.env file (sibling to deploy-k3s/, gitignored) holds observability + admin credentials that 02/03-deploy.sh read but never display:

OBS_INGEST_URL       (https://obs.88oakapps.com/api/v1/write)
OBS_TRACES_URL       (https://obs.88oakapps.com/v1/traces)
OBS_INGEST_TOKEN     (bearer token for VM + Loki + traces — all use same token)
GRAFANA_URL          (https://grafana.88oakapps.com)
GRAFANA_ADMIN_USER   (admin)
GRAFANA_ADMIN_PASSWORD
ADMIN_EMAIL / ADMIN_PASSWORD (in-app admin login)

6. Install from clean boxes — the truthful procedure

This is what we ran on 2026-06-03 to stand up the live cluster, exactly. If you ever rebuild from zero this is the canonical sequence. Total wall-clock: ~12 min for cluster bootstrap; ~10 min for workloads.

6.1 Prerequisites

3 fresh Ubuntu VPS instances (any provider with public IPv4, ≥4 GB RAM, ≥40 GB disk)
~/.ssh/config entries (ovhcloud1/2/3) pointing at them, with passwordless sudo
Local kubectl and curl
The repo's deploy-k3s/secrets/ populated (or the ability to copy live secrets from another running cluster — see §7.2)
deploy/prod.env populated with obs token + Grafana creds

6.2 Per-node OS hardening + firewall (all 3 in parallel)

For each ovhcloudN, over SSH:

export DEBIAN_FRONTEND=noninteractive
sudo apt-get update -qq
sudo apt-get install -y -qq fail2ban unattended-upgrades open-iscsi curl ufw
sudo systemctl enable --now iscsid fail2ban
sudo dpkg-reconfigure -f noninteractive -plow unattended-upgrades

sudo ufw --force reset
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 6443/tcp
SELF=$(hostname -I | awk '{print $1}')
for peer in 51.81.83.33 51.81.87.86 51.81.85.248; do
  [ "$peer" = "$SELF" ] && continue
  sudo ufw allow from "$peer" to any port 2379:2380 proto tcp
  sudo ufw allow from "$peer" to any port 10250        proto tcp
  sudo ufw allow from "$peer" to any port 51820        proto udp
  sudo ufw allow from "$peer" to any port 8472         proto udp
done
sudo ufw --force enable

Watch ordering: allow 22/tcp MUST precede ufw enable. Existing SSH sessions survive (ufw only affects new connections), but a misordered script locks you out of fresh logins.

6.3 Install k3s on `ovhcloud1` (the init node)

ssh ovhcloud1 'curl -sfL https://get.k3s.io | \
  INSTALL_K3S_VERSION=v1.34.6+k3s1 \
  sh -s - server \
    --cluster-init \
    --node-ip=51.81.83.33 \
    --node-external-ip=51.81.83.33 \
    --advertise-address=51.81.83.33 \
    --flannel-backend=wireguard-native \
    --flannel-external-ip \
    --secrets-encryption \
    --write-kubeconfig-mode=0600 \
    --tls-san=51.81.83.33 \
    --tls-san=51.81.87.86 \
    --tls-san=51.81.85.248 \
    --disable-cloud-controller'

Wait for sudo k3s kubectl get nodes to show this node Ready (~2-5 s). Read the cluster token:

ssh ovhcloud1 'sudo cat /var/lib/rancher/k3s/server/node-token'

6.4 Join `ovhcloud2`, then `ovhcloud3` (sequential)

Joining etcd one node at a time avoids split-brain on slow networks. Replace <TOKEN> with the value from 6.3.

For ovhcloud2:

ssh ovhcloud2 'curl -sfL https://get.k3s.io | \
  INSTALL_K3S_VERSION=v1.34.6+k3s1 \
  K3S_TOKEN=<TOKEN> \
  sh -s - server \
    --server=https://51.81.83.33:6443 \
    --node-ip=51.81.87.86 \
    --node-external-ip=51.81.87.86 \
    --advertise-address=51.81.87.86 \
    --flannel-backend=wireguard-native \
    --flannel-external-ip \
    --secrets-encryption \
    --write-kubeconfig-mode=0600 \
    --tls-san=51.81.83.33 --tls-san=51.81.87.86 --tls-san=51.81.85.248 \
    --disable-cloud-controller'

Then identical for ovhcloud3 with --node-ip=51.81.85.248 and --advertise-address=51.81.85.248. After each, wait for kubectl get nodes to show the new node Ready before proceeding.

6.5 Pull kubeconfig to the operator workstation

ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
kubectl get nodes -o wide       # All 3 Ready, INTERNAL-IP = public IP

6.6 Label the redis node

kubectl label node vps-1624d691 honeydue/redis=true --overwrite

(Use whichever k8s node name corresponds to ovhcloud1. The Redis Deployment's nodeSelector binds to this label.)

6.7 Bootstrap manifests NOT applied by `03-deploy.sh`

These must be applied manually on a fresh cluster, before running 03-deploy.sh, or workloads will fail to schedule:

kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl apply -f deploy-k3s/manifests/pod-disruption-budgets.yaml

rbac.yaml creates the 5 ServiceAccounts (api, worker, admin, web, redis) referenced by the Deployment manifests. Without these, ReplicaSets hang on FailedCreate: error looking up service account and pods never start. Symptom on first deploy: kubectl get deploy shows 0 up-to-date across the board with no pod activity — see §9 Gotchas.

Do NOT apply traefik-helmchartconfig.yaml (Hetzner-only — see §10) or kyverno-verify-images.yaml (gated on operator Kyverno install).

6.8 Seed secrets

Two paths; pick whichever fits your situation:

Path A — clean install from local files (the original design):

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/02-setup-secrets.sh

Requires deploy-k3s/secrets/ to contain real postgres_password.txt, secret_key.txt, email_host_password.txt, fcm_server_key.txt, apns_auth_key.p8, cloudflare-origin.crt, cloudflare-origin.key. The script reads config.yaml for registry.*, redis.password, admin.basic_auth_*, and storage.b2_*.

Path B — clone live secrets from another running cluster (what we actually did during the migration; useful if secrets/ is empty or you want exact-byte equivalence):

HETZNER=$(pwd)/deploy-k3s/kubeconfig.hetzner.bak   # or any kubeconfig with the secrets
OVH=$(pwd)/deploy-k3s/kubeconfig
kubectl --kubeconfig=$OVH apply -f deploy-k3s/manifests/namespace.yaml
for S in honeydue-secrets honeydue-apns-key gitea-credentials cloudflare-origin-cert admin-basic-auth; do
  kubectl --kubeconfig=$HETZNER -n honeydue get secret $S -o json \
    | python3 -c "
import json, sys
d = json.load(sys.stdin)
m = d['metadata']
for k in ('uid','resourceVersion','creationTimestamp','generation','managedFields','ownerReferences','selfLink'):
    m.pop(k, None)
m.pop('annotations', None)
print(json.dumps(d))" \
    | kubectl --kubeconfig=$OVH apply -f -
done

After either path, verify:

kubectl -n honeydue get secrets
# Expect: admin-basic-auth, cloudflare-origin-cert, gitea-credentials,
#         honeydue-apns-key, honeydue-secrets

6.9 Deploy workloads

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig \
  ./deploy-k3s/scripts/03-deploy.sh --skip-build --tag latest

--skip-build skips Docker build + push, deploys whatever's already in the registry at the named tag. Use this when migrating between clusters to guarantee both run identical bits.
Without flags it builds the api / worker / admin / web images from the local repo HEAD and pushes to gitea.treytartt.com first.
The script applies (in order): namespace, network-policies (13 of them), redis, ingress, then runs the goose migration Job (blocking on success), then api / worker / admin / web Deployments, then observability (kube-state-metrics, vmagent, alloy-logs).
It does NOT apply: rbac.yaml, pod-disruption-budgets.yaml, traefik-helmchartconfig.yaml, kyverno-verify-images.yaml. The first two must be applied manually (see §6.7); the latter two are Hetzner-only or operator-gated.
It does NOT apply: anything under kratos/ (skipped until kratos-secrets exists, which requires real OIDC client IDs).

6.10 Verify

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/04-verify.sh

Expect: all deployments READY=desired, 13 NetworkPolicies, 7 ServiceAccounts (api, worker, admin, web, redis, vmagent, alloy-logs), 3 PDBs, cloudflare-only middleware present, in-cluster /api/health/ returns 200.

External smoke test (DNS-aware, but the api /health/ route is exempt from the cloudflare-only middleware so direct-IP works for diagnostics):

for IP in 51.81.83.33 51.81.87.86 51.81.85.248; do
  curl -s -o /dev/null -w "$IP -> %{http_code}\n" \
    -H 'Host: api.myhoneydue.com' http://$IP/api/health/
done
# All three should return 200.

6.11 DNS cutover (if migrating)

In the Cloudflare dashboard for myhoneydue.com, set the 4 hostnames in §4 to the OVH IPs and keep proxied. Effective propagation ~30 s to 5 min through the Cloudflare proxy.

If you have a previous cluster, scale its worker to 0 before flipping to avoid scheduled-job double-fires:

KUBECONFIG=<previous>    kubectl -n honeydue scale deploy/worker --replicas=0
# (cut DNS)
KUBECONFIG=<new>         kubectl -n honeydue scale deploy/worker --replicas=1

Run those last two lines back-to-back. Worker work is mostly scheduled (hourly+), so a brief gap is harmless; overlap would cause duplicate emails.

7. Day-to-day operations

Common kubectl one-liners

export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig

# Cluster state
kubectl get nodes -o wide
kubectl -n honeydue get pods
kubectl -n honeydue get deploy
kubectl top nodes
kubectl -n honeydue top pods

# Tail logs
kubectl -n honeydue logs deploy/api -f --tail=50
kubectl -n honeydue logs -l app.kubernetes.io/name=api -f --tail=20
stern -n honeydue api               # if stern is installed (multi-pod)

# Restart a deployment (no image change, picks up ConfigMap changes)
kubectl -n honeydue rollout restart deploy/api

# Rollback one revision
kubectl -n honeydue rollout undo deploy/api

# Scale (worker MUST stay at 0 or 1)
kubectl -n honeydue scale deploy/api --replicas=4

# Get into a pod
kubectl -n honeydue exec -it deploy/api -- sh

Redeploy after code changes

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/03-deploy.sh

Builds images from local HEAD, tags with the git short SHA, pushes to Gitea, runs goose up (idempotent), rolls api/worker/admin/web. Total: ~3-5 min when images change.

To deploy without rebuilding (pin to a specific tag):

./deploy-k3s/scripts/03-deploy.sh --skip-build --tag <tag-or-:latest>

Migrations

Goose migrations live in migrations/. New file pattern:

make migrate-new name=add_foo_column     # generates migrations/YYYYMMDDHHMMSS_add_foo_column.sql
# Edit the file with -- +goose Up / -- +goose Down sections

03-deploy.sh runs a one-shot Job (manifests/migrate/job.yaml) that executes goose up against Neon (direct compute endpoint, not pooler — see file comment). The Job blocks api/worker rollout and aborts the deploy on failure. No app pod runs AutoMigrate; api/worker startup verifies goose_db_version is current and refuses to boot on mismatch.

Grafana

URL: https://grafana.88oakapps.com (creds in deploy/prod.env)

Three dashboards in the honeyDue folder:

UID	Title	Use
`honeydue-eli5-overview`	honeyDue — Overview (ELI5)	Single-screen at-a-glance health: pods up, crashes, errors, RPS, latency, Postgres, memory, top endpoints, push failures, worker activity, recent error logs. Created 2026-06-03.
`honeydue-red`	honeyDue API — RED	Rate/Errors/Duration cuts (legacy)
`honeydue-logs`	honeyDue — Production Logs	Live log explorer

For the ELI5 dashboard's queries, api-side metrics use service="api", NOT namespace="honeydue". vmagent's scrape config drops the namespace label from api metrics — only service, pod, node, job, plus the metric's own labels (route, method, status, etc.) survive. Queries that filter on namespace="honeydue" for api metrics silently match nothing.

kubectl tunnel (if 6443 is firewalled to your IP)

Currently 6443 is open WAN-side (matching the previous Hetzner posture). If you tighten that to operator-IPs-only and your IP changes, use an SSH tunnel:

ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
    -i ~/.ssh/ovhcloud \
    -L 127.0.0.1:6443:127.0.0.1:6443 \
    ubuntu@51.81.83.33

cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
sed -i.bak 's|https://51.81.83.33:6443|https://127.0.0.1:6443|' deploy-k3s/kubeconfig.tunnel
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"

8. Disaster recovery

"I lost the kubeconfig"

ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig

If ovhcloud1 is down but ovhcloud2 or 3 is up, swap host and IP — the TLS SAN covers all three.

"A node is unresponsive"

kubectl drain vps-XXX --ignore-daemonsets --delete-emptydir-data
# Reboot via OVH manager or:
ssh ovhcloudN sudo reboot
# Wait for Ready, then:
kubectl uncordon vps-XXX

The cluster tolerates 1 node down (etcd quorum 2/3). With 2 down, etcd loses quorum and the API server stops accepting writes.

"etcd quorum lost (2+ nodes dead)"

Bring nodes back online if possible. If not:

ssh ovhcloud1 'sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<latest>'

k3s takes automatic etcd snapshots every 12h, keeping 5. List with:

ssh ovhcloud1 sudo ls -la /var/lib/rancher/k3s/server/db/snapshots/

This is destructive — workload state since the snapshot is lost, but Neon (actual app data) is unaffected.

"I have to rebuild the whole cluster from scratch"

Provision 3 fresh boxes, then exactly the sequence in §6. End-to-end is ~30 min. The dependencies that make this possible:

Stays put through rebuild	Where
Application data	Neon Postgres (managed)
User uploads	Backblaze B2 (managed)
Container images	`gitea.treytartt.com` (self-hosted, but not on the OVH cluster)
Operator secrets	`deploy-k3s/secrets/` + `config.yaml` + `deploy/prod.env` on the operator workstation (gitignored)
DNS	Cloudflare control panel

If gitea.treytartt.com is on the same OVH cluster, you have a circular dependency — rebuilding requires images you can't pull until the cluster is up. Currently Gitea is NOT in the honeyDue cluster (separate Hetzner-era host), so this isn't a problem today, but worth flagging if that ever changes.

"Cutover back to Hetzner / failover to a backup cluster"

There is no warm standby today. Bringing up a second cluster is the same §6 procedure on different hardware, then a Cloudflare DNS swap. The worker-swap dance is critical:

KUBECONFIG=<current>  kubectl -n honeydue scale deploy/worker --replicas=0
# (Update Cloudflare DNS to new cluster's IPs — proxied)
KUBECONFIG=<new>      kubectl -n honeydue scale deploy/worker --replicas=1

9. Known gotchas

9.1 First-deploy "0 up-to-date" across all Deployments

Symptoms: kubectl get deploy shows READY 0/N, UP-TO-DATE 0 for api/worker/admin/web/redis. kubectl get events shows FailedCreate: error looking up service account honeydue/<name>: serviceaccount "..." not found.

Cause: rbac.yaml (ServiceAccounts) is NOT applied by 03-deploy.sh. On a fresh cluster the SAs don't exist; the ReplicaSet controller can't create pods.

Fix:

kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl -n honeydue rollout restart deploy/api deploy/worker deploy/admin deploy/web deploy/redis

This was hit during the 2026-06-03 OVH bootstrap. Permanently fix by adding kubectl apply -f rbac.yaml to 03-deploy.sh between the namespace and network-policies apply, but until that lands, follow §6.7 on every fresh cluster.

9.2 vmagent SD broken on fresh deploy ("0 pods up" in Grafana)

Symptoms:

Grafana panels using kube_* metrics or up{job=...} show 0
vmagent logs: dial tcp 10.43.0.1:443: connect: connection refused every ~30 s
Direct test from a pod also refused

Cause: k3s's NetworkPolicy controller evaluates egress rules after kube-proxy's DNAT (not before, contrary to spec). Pod-to-kubernetes-Service (10.43.0.1:443) gets DNAT'd to <node_ip>:6443, then the policy check runs. Without an explicit egress rule for :6443, the packet is rejected.

The allow-egress-from-vmagent NetPol in network-policies.yaml includes both rules:

- to:
    - ipBlock: { cidr: 10.43.0.0/16 }
  ports:
    - { port: 443, protocol: TCP }
- to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except: [10.42.0.0/16]
  ports:
    - { port: 6443, protocol: TCP }

If this happens: confirm network-policies.yaml was applied:

kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443

Counter-evidence that confirms diagnosis: kube-state-metrics in kube-system works fine (no NetPols in that namespace).

9.3 vmagent appears healthy but no data in Grafana

vmagent's /-/healthy returns 200 as long as the process is alive and remote-write is TCP-functional. It doesn't check that scrapes are actually succeeding. The liveness probe in vmagent.yaml queries /api/v1/targets and fails the pod if no target is up. After ~3 failures (~3 min), kubelet recycles it.

If vmagent runs for weeks but Grafana is empty, the probe was disabled or the exec command broke.

9.4 vmagent bearer token destroyed by direct `kubectl apply`

The committed vmagent.yaml has bearer_token: TOKEN_PLACEHOLDER. The real token is sed-substituted at deploy time by 03-deploy.sh. Applying the file directly:

kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml   # WRONG

overwrites the Secret with the literal TOKEN_PLACEHOLDER and remote-writes 401. To restore without a full redeploy:

OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
                  -o jsonpath='{.data.OBS_INGEST_TOKEN}')
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
  -p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
kubectl -n honeydue rollout restart deploy/vmagent

Or just re-run ./deploy-k3s/scripts/03-deploy.sh — the sed handles it.

9.5 Dashboard queries: api metrics need `service="api"` not `namespace="honeydue"`

vmagent's scrape config (vmagent-config ConfigMap) explicitly chooses which Kubernetes pod-metadata labels to copy onto each scraped series. Namespace isn't one of them. Labels you can use on api-side metrics:

service (literal "api")
job (literal "api")
pod (the api pod name)
node (the k8s node name)
cluster (vmagent external_label, currently "honeydue-k3s")
environment (vmagent external_label, currently "prod")
Plus each metric's own labels (method, route, status for HTTP; etc.)

kube_* metrics from kube-state-metrics DO carry namespace natively (KSM publishes it as a label, vmagent passes it through). Loki streams have namespace because alloy-logs explicitly relabels it. So the rule is:

Metric prefix	Use
`kube_*`	`namespace="honeydue"`
`http_`, `gorm_`, `go_`, `process_` (api)	`service="api"`
Loki logs `{...}`	`namespace="honeydue"`

9.6 Cluster-label collision when two clusters run together

Both Hetzner and OVH vmagents push as cluster=honeydue-k3s, environment=prod (same external_labels). During the migration overlap this made dashboards sum both clusters' data. The simplest narrowing during overlap is by node name pattern (node=~"vps-.*" for OVH, node=~"ubuntu-.*" for Hetzner). If you ever bring up a backup cluster long-term, change one cluster's external_labels.cluster to something distinct (e.g. honeydue-ovh vs. honeydue-backup).

9.7 Worker double-firing scheduled jobs

If two worker Deployments run concurrently (e.g. two clusters both pointing at the same Neon DB), Asynq schedulers each fire crons independently — users get duplicate emails. Workaround: scale all-but-one worker to 0. This is the exact mechanic used during cutovers (§6.11).

9.8 Node kubeconfig mode

/etc/rancher/k3s/k3s.yaml on each node is mode 0600 because we install with --write-kubeconfig-mode=0600. Tightening from k3s default (0644) was intentional. Don't change without coordinating — any tooling on the node that expects to read it (none today) will break.

10. Differences from MIGRATION_NOTES.md (Hetzner-era)

MIGRATION_NOTES.md documents the Swarm → k3s migration on Hetzner (2026-04-24). Most of it still applies, with these OVH-specific deltas:

What MIGRATION_NOTES says	What OVH actually has
`hetzner-k3s` provisioner	Manual k3s install (§6)
Hetzner Load Balancer (not used) → Cloudflare round-robin	Same — Cloudflare round-robin (§4)
Traefik as DaemonSet + hostNetwork via HelmChartConfig	Traefik default Deployment + klipper-lb svclb DaemonSet. The `traefik-helmchartconfig.yaml` file is NOT applied on OVH.
`servicelb` disabled (`--disable=servicelb`)	`servicelb` enabled (we didn't pass `--disable=servicelb`). This is what makes klipper-lb work.
sysctl `net.ipv4.ip_unprivileged_port_start=0` for hostNetwork Traefik	Not needed — klipper-lb proxies the port binding instead
UFW rules between 3 Hetzner IPs	UFW rules between 3 OVH IPs (51.81.83.33, 51.81.87.86, 51.81.85.248)
Kubeconfig at `~/.kube/honeydue-k3s.yaml`	Kubeconfig at `deploy-k3s/kubeconfig`
TLS at origin: not configured (CF Flexible)	Same — CF Flexible. `cloudflare-origin-cert` Secret exists (carried over) but Ingress doesn't reference it.

11. Outstanding follow-ups (deferred, not blocking)

No warm standby / rollback cluster. OVH is solo production. An OVH outage is a real outage; recovery time = §6 procedure (~30 min). User plans to bring a second cluster up as a target.
UFW allows 80/443 from world. Hetzner had a network-layer Cloudflare-IP allowlist on these ports. OVH currently relies on the L7 cloudflare-only Traefik middleware, which protects admin but NOT api / web / apex (those routes have to be reachable from anywhere, but they're then trivially DDoSable bypassing Cloudflare). Fix: add ufw allow rules restricting 80/tcp and 443/tcp to Cloudflare's published IP ranges (~22 IPv4 prefixes from https://www.cloudflare.com/ips-v4/).
Cloudflare TLS Flexible → Full(strict). Origin certs exist as Secret but Ingress doesn't terminate TLS. Upgrading to Full(strict) requires Traefik configured with the cert + an HTTPS entrypoint + Ingress tls: block.
rbac.yaml + pod-disruption-budgets.yaml should be in 03-deploy.sh. They're currently bootstrap-only. Adding them is idempotent and prevents the §9.1 footgun.
Push notification metrics are log-derived, not counters. Successes aren't logged or counted. Proper Prometheus instrumentation (~15 lines in internal/push/client.go) would give a real success/failure ratio.
Worker has no /metrics endpoint. cmd/worker/main.go serves :6060 for healthz only. Adding Asynq's metrics.NewPrometheusExporter() + a ServiceMonitor + uncommenting the worker job stanza in vmagent-config ConfigMap would give real queue depth and job latency.
Ory Kratos. Manifests exist (manifests/kratos/) but the deploy is gated on operator-side prerequisites (Neon kratos database, auth.myhoneydue.com DNS, real Apple+Google OIDC clients, Kratos image tag pinned). Until kratos-secrets exists, 03-deploy.sh silently skips the Kratos apply.
**Hetzner cluster fully retired? config.yaml nodes: block describes OVH; the bak kubeconfig is at kubeconfig.hetzner.bak. Boxes themselves are operator-managed.

11.1 Dashboard observability gaps (raised 2026-06-03 during dashboard build)

Surfaced while building the honeydue-eli5-overview Grafana dashboard. Each needs code or infra changes to expose; none blocks today's operations.

node-exporter not deployed. No node-level metrics today (node_filesystem_avail_bytes, node_memory_*, node_load1, etc.). The dashboard's pod-level memory/CPU panels are app-process only — a node running out of disk would silently fail the cluster before any dashboard signal showed it. Highest-priority Tier-3 item. Fix: deploy node-exporter as a DaemonSet (~50 lines of YAML), add a scrape stanza to vmagent-config, add a Node disk free stat panel.
Traefik metrics not enabled. Traefik can expose /metrics with traefik_entrypoint_requests_total + traefik_service_request_duration_seconds, giving edge-level visibility into requests that never reached api pods (404s, redirects, middleware blocks). Enable via a HelmChartConfig override that sets metrics.prometheus.entryPoint=metrics
- adds a :9100 entryPoint + a scrape stanza. Skipped today to avoid Traefik restart risk; safe additive change when ready.
Push notification success/failure counters (already #5). Add prometheus.NewCounterVec in internal/push/client.go with labels platform={ios,android}, outcome={success,failed,breaker_open,disabled}. Increments at every Send/SendActionable branch. Replaces the log-derived "Push failures" stat on the dashboard with a real success rate.
Worker queue / job metrics (already #6). Asynq has a built-in Prometheus exporter (asynq/x/metrics). Wire it into the worker's :6060 health server (a single healthMux.Handle line) and uncomment the worker scrape stanza in vmagent-config. Surfaces queue depth, retry count, processing time per task type.
Cache hit / miss rate. internal/services/cache_service.go has no counters. Add a Counter with labels {operation=get|set, result=hit|miss} around the cache wrapper. ~10 lines. Useful once real traffic flows to verify the ETag and Redis caches are paying their keep.
APNs send-latency histogram. Wrap internal/push/apns.go::Send in a prometheus.NewHistogramVec keyed on outcome. Tells you when Apple's gateway is slow (which correlates with their incident page).

12. Audit trail

Date	Change
2026-04-24	Initial k3s cluster on Hetzner (Swarm → k3s migration) — see MIGRATION_NOTES.md
2026-04-25	`config.yaml` reconstructed from live ConfigMap (original file lost)
2026-05-15	Audit fixes: Redis auth required, admin basic auth, secrets-encryption flag
2026-05-16	`02-setup-secrets.sh` started carrying B2 credentials (was a manifest/script drift)
2026-06-02	Kratos scaffolding committed (not deployed)
2026-06-03	Hetzner → OVH BHS cutover. New 3-node cluster on 51.81.83.33, .87.86, .85.248. DNS cut on Cloudflare. Hetzner kubeconfig moved to `.bak`. Grafana `honeydue-eli5-overview` dashboard created. Hetzner cluster powered off later same day.
2026-06-03	Dashboard build-out: extended `honeydue-eli5-overview` to 22 panels covering Tier-1 (HTTP status, CPU per pod, goroutines, top slow) and Tier-2 (GC, network I/O, pod uptime, top 5xx) signals. Surfaced Tier-3 instrumentation gaps in §11.1.

40 KiB Raw Permalink Blame History Unescape Escape