Files
Trey t e448ec66dc docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs
Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover:

- Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions,
  network/firewall, DNS, filesystem layout — all reflect the live OVH
  install instead of the historical Hetzner setup.
- Section 6: canonical install-from-clean-boxes procedure (the literal
  commands run on 2026-06-03), so anyone can stand up a backup cluster
  by following along.
- Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away,
  healthy-but-empty) and adds four new ones discovered during the OVH
  build: rbac.yaml not in 03-deploy.sh, namespace label missing from api
  metrics (use service="api"), cluster-label collision when two clusters
  push concurrently, worker double-firing on cutover.
- Section 11.1: enumerates Tier-3 observability gaps surfaced while
  building the honeydue-eli5-overview dashboard (node-exporter not
  deployed, Traefik metrics off, push success counters absent, worker
  /metrics endpoint absent, cache hit rate uninstrumented, APNs latency
  uninstrumented).
- Section 12: dated audit trail of cluster changes.

Pure documentation; no code or manifest changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 09:34:35 -05:00

40 KiB
Raw Permalink Blame History

honeyDue k3s Cluster — Operations Runbook

Living document for the honeyDue production cluster. Add entries when you hit something non-obvious so future-you (or your replacement) doesn't have to rediscover it.

Last full revision: 2026-06-03 (Hetzner → OVH BHS cutover; cluster solo production from that date forward). For pre-OVH history, see MIGRATION_NOTES.md (Swarm → k3s migration on Hetzner, 2026-04-24).


1. Topology and inventory

Hosting

Provider OVHcloud (us.ovhcloud.com)
Datacenter BHS — Beauharnois, Quebec, Canada
Plan VPS-1 × 3 (~$6.46/mo each, ~$19/mo total)
Node spec 4 vCPU (Intel Haswell, shared), 7.6 GB RAM, 75 GB NVMe
Public bandwidth 400 Mbps per node, unlimited traffic
Private network None. Nodes have public IPv4 + IPv6 only; inter-node traffic crosses the public internet (encrypted by flannel WireGuard backend — see §3)

Nodes

SSH alias Kubernetes node name Public IPv4 Public IPv6 Roles
ovhcloud1 vps-1624d691 51.81.83.33 2604:2dc0:101:200::5a9a control-plane, etcd, redis-pinned
ovhcloud2 vps-c0f51be2 51.81.87.86 2604:2dc0:101:200::30d4 control-plane, etcd
ovhcloud3 vps-dbca24c7 51.81.85.248 2604:2dc0:101:200::450f control-plane, etcd

The cluster is all-control-plane (workloads schedule on the same nodes that run etcd and the API server). vps-1624d691 carries the honeydue/redis=true label so the Redis Deployment's nodeSelector binds there; the Redis PVC (local-path, host-pinned) lives on that node's disk.

SSH access

~/.ssh/config entries (operator workstation):

Host ovhcloud1
    HostName 51.81.83.33
    Port 22
    User ubuntu
    IdentityFile ~/.ssh/ovhcloud
    IdentitiesOnly yes
Host ovhcloud2
    HostName 51.81.87.86
    Port 22
    User ubuntu
    IdentityFile ~/.ssh/ovhcloud
    IdentitiesOnly yes
Host ovhcloud3
    HostName 51.81.85.248
    Port 22
    User ubuntu
    IdentityFile ~/.ssh/ovhcloud
    IdentitiesOnly yes

ubuntu has passwordless sudo (/etc/sudoers.d/90-cloud-init-users from OVH's cloud-init).

kubectl access

export KUBECONFIG=/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/deploy-k3s/kubeconfig
kubectl get nodes

The deploy-k3s/kubeconfig file (mode 0600, gitignored) is the OVH cluster's admin kubeconfig with server: https://51.81.83.33:6443. A stale Hetzner copy lives next to it as kubeconfig.hetzner.bak for historical reference; the Hetzner cluster is powered off and that file's API server is unreachable.

To refresh from the cluster (if the local copy is lost or rotated):

ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig

The k3s API at :6443 is open to the public internet (token-protected).


2. Software

Kernel-level

OS Ubuntu 26.04 LTS (set by OVH's VPS-1 image)
Kernel 7.0.0-14-generic
Init systemd
Container runtime containerd 2.2.2 (bundled with k3s)
Firewall ufw (per-node, configured at install — see §3)
Other host packages fail2ban (SSH brute-force protection, default jail), unattended-upgrades (security updates), open-iscsi (k3s prereq for some storage backends), curl

Kubernetes

Distribution k3s
Version v1.34.6+k3s1 (pinned in config.yaml:cluster.k3s_version)
Control plane 3-node HA, embedded etcd (no external Postgres backing store)
CNI / networking flannel with WireGuard-native backend (--flannel-backend=wireguard-native). Encrypts pod-to-pod and etcd peer traffic because nodes only have public IPs (no private network). ~3-5% CPU overhead under load.
Service LB klipper-lb (default k3s servicelb). The svclb-traefik DaemonSet binds host ports :80 and :443 on each node and forwards to the Traefik Service. Not the DaemonSet-w/-hostNetwork Traefik pattern used on the old Hetzner cluster — see §10 Differences from MIGRATION_NOTES.
Ingress controller Traefik (k3s default), single-replica Deployment, exposed via klipper-lb
DNS CoreDNS (k3s default)
Secrets encryption Enabled (--secrets-encryption); etcd values are AES-CBC encrypted at rest
kubeconfig perms 0600 (--write-kubeconfig-mode=0600)
Cloud controller Disabled (--disable-cloud-controller) — no provider integration on OVH
Misc --node-ip / --node-external-ip / --advertise-address all set to each node's public IPv4. TLS SANs cover all 3 IPs so any IP can serve the API.

Application stack (in cluster, honeydue namespace)

Deployment Replicas Image (digest-pinned) Notes
api 3 gitea.treytartt.com/admin/honeydue-api@sha256:34fde6... Go REST API on :8000, exposes /metrics
web 3 gitea.treytartt.com/admin/honeydue-web@sha256:8c62cf... Next.js, server-side proxy to api
admin 1 gitea.treytartt.com/admin/honeydue-admin@sha256:b81263... Next.js admin panel, gated behind Traefik basic-auth
worker 1 gitea.treytartt.com/admin/honeydue-worker@sha256:fe1f5e... Asynq scheduler + Redis-backed jobs (singleton — must not run as >1 replica or every cron fires N×)
redis 1 redis:7-alpine@sha256:6ab0b6... Pinned to vps-1624d691 via honeydue/redis=true. PVC redis-data (local-path, 5 Gi). Password-auth required.
vmagent 1 victoriametrics/vmagent@sha256:... (default tag) Scrapes api /metrics + kube-state-metrics; remote-writes to obs.88oakapps.com
kube-state-metrics 1 kube-state-metrics@sha256:... In kube-system, scraped by vmagent for kube_* cluster-state metrics
alloy-logs (DaemonSet) 3 (1/node) grafana/alloy@sha256:... Tails /var/log/pods/* and ships to Loki at obs.88oakapps.com

The Asynq scheduler inside worker registers these cron jobs:

Cron Job Notes
0 * * * * Smart reminder check (per-user hour) Default user hour: 14:00 UTC
0 * * * * Daily digest check (per-user hour) Default user hour: 03:00 UTC
0 10 * * * Onboarding emails 10:00 UTC
0 3 * * * Reminder log cleanup 03:00 UTC
30 * * * * Pending uploads cleanup xx:30 every hour

External dependencies

Service Endpoint Purpose Failure mode
Neon Postgres ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech:5432 App data. Pooler endpoint (transaction-mode PgBouncer in front of Neon compute) so connections stay warm. api / worker pods crash-loop with dial tcp: connection refused. Health endpoint returns postgres: error.
Backblaze B2 (S3-compatible) s3.us-east-005.backblazeb2.com (bucket honeyDueProd) User uploads (photos, PDFs, completion attachments) Upload routes return 5xx; reads of cached/static files still work.
Cloudflare myhoneydue.com zone DNS + TLS termination + edge cache + DDoS Traffic stops reaching origin. Direct https://51.81.x.x still works for diagnostics.
obs.88oakapps.com Operator-run Grafana + VictoriaMetrics + Loki Metrics & logs vmagent + alloy-logs back off and retry. No app-side impact.
Apple APNs api.push.apple.com:443 (production) iOS push notifications Push fails; circuit breaker opens; failure logged. App functionality unaffected.
Fastmail SMTP smtp.fastmail.com:587 Transactional emails (verification, recovery, digests) Email send fails in the worker; logged; user reset/digest flow degrades.
Gitea registry gitea.treytartt.com Container image registry Deploys can't pull. Existing pods keep running on cached images.

3. Network and firewall

Per-node ufw configuration

Applied during install (same on all 3 nodes):

default deny incoming
default allow outgoing
allow 22/tcp                  (SSH, world)
allow 80/tcp                  (HTTP via Cloudflare, world — see GAP-1)
allow 443/tcp                 (HTTPS, same — GAP-1)
allow 6443/tcp                (k3s API, world, token-protected)
allow 2379:2380/tcp from <other 2 OVH IPs>   (etcd client + peer)
allow 10250/tcp from <other 2 OVH IPs>       (kubelet)
allow 51820/udp from <other 2 OVH IPs>       (WireGuard tunnel)
allow 8472/udp  from <other 2 OVH IPs>       (VXLAN, defense-in-depth fallback)

To inspect: ssh ovhcloudN sudo ufw status numbered.

Cluster networking

  • Pod CIDR: 10.42.0.0/16 (default k3s)
  • Service CIDR: 10.43.0.0/16 (default k3s)
  • Flannel backend: WireGuard-native. Each node hosts a flannel-wg interface on UDP 51820 and tunnels pod traffic to peers. Verify: ssh ovhcloudN ip -d link show flannel-wg.

Traefik ingress flow

Cloudflare → node:80/443 (public)
  → klipper-lb svclb-traefik DaemonSet pod (hostPort:80/443)
  → Traefik Service (ClusterIP 10.43.245.127:80/443)
  → Traefik Deployment pod (single replica)
  → matches Ingress host rule (api.myhoneydue.com etc.)
  → routes to backend Service (api / web / admin)
  → backend Pod

The Traefik default also lives in kube-system and is managed by k3s's HelmChart. No HelmChartConfig override is applied on OVH (unlike Hetzner — see §10).


4. DNS configuration (Cloudflare)

The myhoneydue.com zone in Cloudflare has these public records. All hostnames are proxied (orange cloud) — required by the cloudflare-only Traefik middleware which 403s any non-CF source IP.

Host Type Values Proxy
api.myhoneydue.com A × 3 51.81.83.33, 51.81.87.86, 51.81.85.248 Proxied
app.myhoneydue.com A × 3 (same trio) Proxied
admin.myhoneydue.com A × 3 (same trio) Proxied
myhoneydue.com (apex @) A × 3 (same trio) Proxied

Cloudflare round-robins among the 3 origins, klipper-lb on whichever node CF hits forwards to Traefik, and Traefik routes by Host header. Per-request, effectively load-balanced across the 3 nodes for ingress, with no central LB.

SSL/TLS mode: Flexible (CF terminates TLS at the edge; origin is plain HTTP on :80). Upgrading to Full (strict) is on the deferred list — would need an origin certificate provisioned to cloudflare-origin-cert secret and Traefik configured for TLS termination.


5. Filesystem layout (deploy-k3s/)

deploy-k3s/
├── config.yaml                 # Single config source (gitignored; contains tokens)
├── config.yaml.example         # Template
├── kubeconfig                  # OVH admin kubeconfig (gitignored, 0600)
├── kubeconfig.hetzner.bak      # Old Hetzner kubeconfig (unreachable, kept for history)
├── kubeconfig.tunnel           # Optional: localhost-pointing copy for SSH-tunnel use
├── secrets/
│   ├── README.md
│   ├── postgres_password.txt   # Neon DB password
│   ├── secret_key.txt          # 32+ char app-token signing secret
│   ├── email_host_password.txt # Fastmail SMTP app password
│   ├── fcm_server_key.txt      # FCM server key (currently unused — Android push disabled)
│   ├── apns_auth_key.p8        # APNs auth key (binary)
│   ├── cloudflare-origin.crt   # Origin certificate (currently unused — CF Flexible)
│   └── cloudflare-origin.key
│   (all gitignored except README.md)
├── manifests/
│   ├── namespace.yaml
│   ├── network-policies.yaml   # default-deny + per-app egress/ingress (13 NetPols total)
│   ├── rbac.yaml               # api/worker/admin/web/redis ServiceAccounts (NOT applied by 03-deploy.sh; manual once)
│   ├── pod-disruption-budgets.yaml  # api-pdb, web-pdb, worker-pdb (NOT applied by 03-deploy.sh; manual once)
│   ├── traefik-helmchartconfig.yaml # Hetzner-only DaemonSet+hostNetwork override (do NOT apply on OVH; we use default klipper-lb)
│   ├── kyverno-verify-images.yaml   # Operator-gated policy (do NOT apply blindly — see file comment)
│   ├── api/{deployment,service,hpa}.yaml
│   ├── worker/deployment.yaml
│   ├── admin/{deployment,service}.yaml
│   ├── web/{deployment,service}.yaml
│   ├── redis/{deployment,service,pvc}.yaml
│   ├── ingress/{middleware,ingress-simple}.yaml
│   ├── migrate/job.yaml        # goose migration Job (image-subbed at deploy time)
│   ├── observability/{kube-state-metrics,vmagent,alloy-logs}.yaml
│   └── kratos/                 # Ory Kratos identity service (NOT yet deployed; gated on operator OIDC setup)
└── scripts/
    ├── _config.sh              # Sourced by all scripts: cfg(), generate_env(), generate_cluster_config()
    ├── 01-provision-cluster.sh # Hetzner-Cloud-specific (uses hetzner-k3s CLI) — DO NOT RUN ON OVH
    ├── 02-setup-secrets.sh     # Creates honeydue-secrets etc. from secrets/ + config.yaml; kubeconfig-driven
    ├── 03-deploy.sh            # Build + push + apply manifests + roll deployments; kubeconfig-driven
    ├── 04-verify.sh            # Post-deploy health + security checks; kubeconfig-driven
    └── rollback.sh             # `kubectl rollout undo` across all deployments

The deploy/prod.env file (sibling to deploy-k3s/, gitignored) holds observability + admin credentials that 02/03-deploy.sh read but never display:

OBS_INGEST_URL       (https://obs.88oakapps.com/api/v1/write)
OBS_TRACES_URL       (https://obs.88oakapps.com/v1/traces)
OBS_INGEST_TOKEN     (bearer token for VM + Loki + traces — all use same token)
GRAFANA_URL          (https://grafana.88oakapps.com)
GRAFANA_ADMIN_USER   (admin)
GRAFANA_ADMIN_PASSWORD
ADMIN_EMAIL / ADMIN_PASSWORD (in-app admin login)

6. Install from clean boxes — the truthful procedure

This is what we ran on 2026-06-03 to stand up the live cluster, exactly. If you ever rebuild from zero this is the canonical sequence. Total wall-clock: ~12 min for cluster bootstrap; ~10 min for workloads.

6.1 Prerequisites

  • 3 fresh Ubuntu VPS instances (any provider with public IPv4, ≥4 GB RAM, ≥40 GB disk)
  • ~/.ssh/config entries (ovhcloud1/2/3) pointing at them, with passwordless sudo
  • Local kubectl and curl
  • The repo's deploy-k3s/secrets/ populated (or the ability to copy live secrets from another running cluster — see §7.2)
  • deploy/prod.env populated with obs token + Grafana creds

6.2 Per-node OS hardening + firewall (all 3 in parallel)

For each ovhcloudN, over SSH:

export DEBIAN_FRONTEND=noninteractive
sudo apt-get update -qq
sudo apt-get install -y -qq fail2ban unattended-upgrades open-iscsi curl ufw
sudo systemctl enable --now iscsid fail2ban
sudo dpkg-reconfigure -f noninteractive -plow unattended-upgrades

sudo ufw --force reset
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 6443/tcp
SELF=$(hostname -I | awk '{print $1}')
for peer in 51.81.83.33 51.81.87.86 51.81.85.248; do
  [ "$peer" = "$SELF" ] && continue
  sudo ufw allow from "$peer" to any port 2379:2380 proto tcp
  sudo ufw allow from "$peer" to any port 10250        proto tcp
  sudo ufw allow from "$peer" to any port 51820        proto udp
  sudo ufw allow from "$peer" to any port 8472         proto udp
done
sudo ufw --force enable

Watch ordering: allow 22/tcp MUST precede ufw enable. Existing SSH sessions survive (ufw only affects new connections), but a misordered script locks you out of fresh logins.

6.3 Install k3s on ovhcloud1 (the init node)

ssh ovhcloud1 'curl -sfL https://get.k3s.io | \
  INSTALL_K3S_VERSION=v1.34.6+k3s1 \
  sh -s - server \
    --cluster-init \
    --node-ip=51.81.83.33 \
    --node-external-ip=51.81.83.33 \
    --advertise-address=51.81.83.33 \
    --flannel-backend=wireguard-native \
    --flannel-external-ip \
    --secrets-encryption \
    --write-kubeconfig-mode=0600 \
    --tls-san=51.81.83.33 \
    --tls-san=51.81.87.86 \
    --tls-san=51.81.85.248 \
    --disable-cloud-controller'

Wait for sudo k3s kubectl get nodes to show this node Ready (~2-5 s). Read the cluster token:

ssh ovhcloud1 'sudo cat /var/lib/rancher/k3s/server/node-token'

6.4 Join ovhcloud2, then ovhcloud3 (sequential)

Joining etcd one node at a time avoids split-brain on slow networks. Replace <TOKEN> with the value from 6.3.

For ovhcloud2:

ssh ovhcloud2 'curl -sfL https://get.k3s.io | \
  INSTALL_K3S_VERSION=v1.34.6+k3s1 \
  K3S_TOKEN=<TOKEN> \
  sh -s - server \
    --server=https://51.81.83.33:6443 \
    --node-ip=51.81.87.86 \
    --node-external-ip=51.81.87.86 \
    --advertise-address=51.81.87.86 \
    --flannel-backend=wireguard-native \
    --flannel-external-ip \
    --secrets-encryption \
    --write-kubeconfig-mode=0600 \
    --tls-san=51.81.83.33 --tls-san=51.81.87.86 --tls-san=51.81.85.248 \
    --disable-cloud-controller'

Then identical for ovhcloud3 with --node-ip=51.81.85.248 and --advertise-address=51.81.85.248. After each, wait for kubectl get nodes to show the new node Ready before proceeding.

6.5 Pull kubeconfig to the operator workstation

ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
kubectl get nodes -o wide       # All 3 Ready, INTERNAL-IP = public IP

6.6 Label the redis node

kubectl label node vps-1624d691 honeydue/redis=true --overwrite

(Use whichever k8s node name corresponds to ovhcloud1. The Redis Deployment's nodeSelector binds to this label.)

6.7 Bootstrap manifests NOT applied by 03-deploy.sh

These must be applied manually on a fresh cluster, before running 03-deploy.sh, or workloads will fail to schedule:

kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl apply -f deploy-k3s/manifests/pod-disruption-budgets.yaml

rbac.yaml creates the 5 ServiceAccounts (api, worker, admin, web, redis) referenced by the Deployment manifests. Without these, ReplicaSets hang on FailedCreate: error looking up service account and pods never start. Symptom on first deploy: kubectl get deploy shows 0 up-to-date across the board with no pod activity — see §9 Gotchas.

Do NOT apply traefik-helmchartconfig.yaml (Hetzner-only — see §10) or kyverno-verify-images.yaml (gated on operator Kyverno install).

6.8 Seed secrets

Two paths; pick whichever fits your situation:

Path A — clean install from local files (the original design):

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/02-setup-secrets.sh

Requires deploy-k3s/secrets/ to contain real postgres_password.txt, secret_key.txt, email_host_password.txt, fcm_server_key.txt, apns_auth_key.p8, cloudflare-origin.crt, cloudflare-origin.key. The script reads config.yaml for registry.*, redis.password, admin.basic_auth_*, and storage.b2_*.

Path B — clone live secrets from another running cluster (what we actually did during the migration; useful if secrets/ is empty or you want exact-byte equivalence):

HETZNER=$(pwd)/deploy-k3s/kubeconfig.hetzner.bak   # or any kubeconfig with the secrets
OVH=$(pwd)/deploy-k3s/kubeconfig
kubectl --kubeconfig=$OVH apply -f deploy-k3s/manifests/namespace.yaml
for S in honeydue-secrets honeydue-apns-key gitea-credentials cloudflare-origin-cert admin-basic-auth; do
  kubectl --kubeconfig=$HETZNER -n honeydue get secret $S -o json \
    | python3 -c "
import json, sys
d = json.load(sys.stdin)
m = d['metadata']
for k in ('uid','resourceVersion','creationTimestamp','generation','managedFields','ownerReferences','selfLink'):
    m.pop(k, None)
m.pop('annotations', None)
print(json.dumps(d))" \
    | kubectl --kubeconfig=$OVH apply -f -
done

After either path, verify:

kubectl -n honeydue get secrets
# Expect: admin-basic-auth, cloudflare-origin-cert, gitea-credentials,
#         honeydue-apns-key, honeydue-secrets

6.9 Deploy workloads

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig \
  ./deploy-k3s/scripts/03-deploy.sh --skip-build --tag latest
  • --skip-build skips Docker build + push, deploys whatever's already in the registry at the named tag. Use this when migrating between clusters to guarantee both run identical bits.
  • Without flags it builds the api / worker / admin / web images from the local repo HEAD and pushes to gitea.treytartt.com first.
  • The script applies (in order): namespace, network-policies (13 of them), redis, ingress, then runs the goose migration Job (blocking on success), then api / worker / admin / web Deployments, then observability (kube-state-metrics, vmagent, alloy-logs).
  • It does NOT apply: rbac.yaml, pod-disruption-budgets.yaml, traefik-helmchartconfig.yaml, kyverno-verify-images.yaml. The first two must be applied manually (see §6.7); the latter two are Hetzner-only or operator-gated.
  • It does NOT apply: anything under kratos/ (skipped until kratos-secrets exists, which requires real OIDC client IDs).

6.10 Verify

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/04-verify.sh

Expect: all deployments READY=desired, 13 NetworkPolicies, 7 ServiceAccounts (api, worker, admin, web, redis, vmagent, alloy-logs), 3 PDBs, cloudflare-only middleware present, in-cluster /api/health/ returns 200.

External smoke test (DNS-aware, but the api /health/ route is exempt from the cloudflare-only middleware so direct-IP works for diagnostics):

for IP in 51.81.83.33 51.81.87.86 51.81.85.248; do
  curl -s -o /dev/null -w "$IP -> %{http_code}\n" \
    -H 'Host: api.myhoneydue.com' http://$IP/api/health/
done
# All three should return 200.

6.11 DNS cutover (if migrating)

In the Cloudflare dashboard for myhoneydue.com, set the 4 hostnames in §4 to the OVH IPs and keep proxied. Effective propagation ~30 s to 5 min through the Cloudflare proxy.

If you have a previous cluster, scale its worker to 0 before flipping to avoid scheduled-job double-fires:

KUBECONFIG=<previous>    kubectl -n honeydue scale deploy/worker --replicas=0
# (cut DNS)
KUBECONFIG=<new>         kubectl -n honeydue scale deploy/worker --replicas=1

Run those last two lines back-to-back. Worker work is mostly scheduled (hourly+), so a brief gap is harmless; overlap would cause duplicate emails.


7. Day-to-day operations

Common kubectl one-liners

export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig

# Cluster state
kubectl get nodes -o wide
kubectl -n honeydue get pods
kubectl -n honeydue get deploy
kubectl top nodes
kubectl -n honeydue top pods

# Tail logs
kubectl -n honeydue logs deploy/api -f --tail=50
kubectl -n honeydue logs -l app.kubernetes.io/name=api -f --tail=20
stern -n honeydue api               # if stern is installed (multi-pod)

# Restart a deployment (no image change, picks up ConfigMap changes)
kubectl -n honeydue rollout restart deploy/api

# Rollback one revision
kubectl -n honeydue rollout undo deploy/api

# Scale (worker MUST stay at 0 or 1)
kubectl -n honeydue scale deploy/api --replicas=4

# Get into a pod
kubectl -n honeydue exec -it deploy/api -- sh

Redeploy after code changes

KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/03-deploy.sh

Builds images from local HEAD, tags with the git short SHA, pushes to Gitea, runs goose up (idempotent), rolls api/worker/admin/web. Total: ~3-5 min when images change.

To deploy without rebuilding (pin to a specific tag):

./deploy-k3s/scripts/03-deploy.sh --skip-build --tag <tag-or-:latest>

Migrations

Goose migrations live in migrations/. New file pattern:

make migrate-new name=add_foo_column     # generates migrations/YYYYMMDDHHMMSS_add_foo_column.sql
# Edit the file with -- +goose Up / -- +goose Down sections

03-deploy.sh runs a one-shot Job (manifests/migrate/job.yaml) that executes goose up against Neon (direct compute endpoint, not pooler — see file comment). The Job blocks api/worker rollout and aborts the deploy on failure. No app pod runs AutoMigrate; api/worker startup verifies goose_db_version is current and refuses to boot on mismatch.

Grafana

URL: https://grafana.88oakapps.com (creds in deploy/prod.env)

Three dashboards in the honeyDue folder:

UID Title Use
honeydue-eli5-overview honeyDue — Overview (ELI5) Single-screen at-a-glance health: pods up, crashes, errors, RPS, latency, Postgres, memory, top endpoints, push failures, worker activity, recent error logs. Created 2026-06-03.
honeydue-red honeyDue API — RED Rate/Errors/Duration cuts (legacy)
honeydue-logs honeyDue — Production Logs Live log explorer

For the ELI5 dashboard's queries, api-side metrics use service="api", NOT namespace="honeydue". vmagent's scrape config drops the namespace label from api metrics — only service, pod, node, job, plus the metric's own labels (route, method, status, etc.) survive. Queries that filter on namespace="honeydue" for api metrics silently match nothing.

kubectl tunnel (if 6443 is firewalled to your IP)

Currently 6443 is open WAN-side (matching the previous Hetzner posture). If you tighten that to operator-IPs-only and your IP changes, use an SSH tunnel:

ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
    -i ~/.ssh/ovhcloud \
    -L 127.0.0.1:6443:127.0.0.1:6443 \
    ubuntu@51.81.83.33

cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
sed -i.bak 's|https://51.81.83.33:6443|https://127.0.0.1:6443|' deploy-k3s/kubeconfig.tunnel
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"

8. Disaster recovery

"I lost the kubeconfig"

ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig

If ovhcloud1 is down but ovhcloud2 or 3 is up, swap host and IP — the TLS SAN covers all three.

"A node is unresponsive"

kubectl drain vps-XXX --ignore-daemonsets --delete-emptydir-data
# Reboot via OVH manager or:
ssh ovhcloudN sudo reboot
# Wait for Ready, then:
kubectl uncordon vps-XXX

The cluster tolerates 1 node down (etcd quorum 2/3). With 2 down, etcd loses quorum and the API server stops accepting writes.

"etcd quorum lost (2+ nodes dead)"

Bring nodes back online if possible. If not:

ssh ovhcloud1 'sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<latest>'

k3s takes automatic etcd snapshots every 12h, keeping 5. List with:

ssh ovhcloud1 sudo ls -la /var/lib/rancher/k3s/server/db/snapshots/

This is destructive — workload state since the snapshot is lost, but Neon (actual app data) is unaffected.

"I have to rebuild the whole cluster from scratch"

Provision 3 fresh boxes, then exactly the sequence in §6. End-to-end is ~30 min. The dependencies that make this possible:

Stays put through rebuild Where
Application data Neon Postgres (managed)
User uploads Backblaze B2 (managed)
Container images gitea.treytartt.com (self-hosted, but not on the OVH cluster)
Operator secrets deploy-k3s/secrets/ + config.yaml + deploy/prod.env on the operator workstation (gitignored)
DNS Cloudflare control panel

If gitea.treytartt.com is on the same OVH cluster, you have a circular dependency — rebuilding requires images you can't pull until the cluster is up. Currently Gitea is NOT in the honeyDue cluster (separate Hetzner-era host), so this isn't a problem today, but worth flagging if that ever changes.

"Cutover back to Hetzner / failover to a backup cluster"

There is no warm standby today. Bringing up a second cluster is the same §6 procedure on different hardware, then a Cloudflare DNS swap. The worker-swap dance is critical:

KUBECONFIG=<current>  kubectl -n honeydue scale deploy/worker --replicas=0
# (Update Cloudflare DNS to new cluster's IPs — proxied)
KUBECONFIG=<new>      kubectl -n honeydue scale deploy/worker --replicas=1

9. Known gotchas

9.1 First-deploy "0 up-to-date" across all Deployments

Symptoms: kubectl get deploy shows READY 0/N, UP-TO-DATE 0 for api/worker/admin/web/redis. kubectl get events shows FailedCreate: error looking up service account honeydue/<name>: serviceaccount "..." not found.

Cause: rbac.yaml (ServiceAccounts) is NOT applied by 03-deploy.sh. On a fresh cluster the SAs don't exist; the ReplicaSet controller can't create pods.

Fix:

kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl -n honeydue rollout restart deploy/api deploy/worker deploy/admin deploy/web deploy/redis

This was hit during the 2026-06-03 OVH bootstrap. Permanently fix by adding kubectl apply -f rbac.yaml to 03-deploy.sh between the namespace and network-policies apply, but until that lands, follow §6.7 on every fresh cluster.

9.2 vmagent SD broken on fresh deploy ("0 pods up" in Grafana)

Symptoms:

  • Grafana panels using kube_* metrics or up{job=...} show 0
  • vmagent logs: dial tcp 10.43.0.1:443: connect: connection refused every ~30 s
  • Direct test from a pod also refused

Cause: k3s's NetworkPolicy controller evaluates egress rules after kube-proxy's DNAT (not before, contrary to spec). Pod-to-kubernetes-Service (10.43.0.1:443) gets DNAT'd to <node_ip>:6443, then the policy check runs. Without an explicit egress rule for :6443, the packet is rejected.

The allow-egress-from-vmagent NetPol in network-policies.yaml includes both rules:

- to:
    - ipBlock: { cidr: 10.43.0.0/16 }
  ports:
    - { port: 443, protocol: TCP }
- to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except: [10.42.0.0/16]
  ports:
    - { port: 6443, protocol: TCP }

If this happens: confirm network-policies.yaml was applied:

kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443

Counter-evidence that confirms diagnosis: kube-state-metrics in kube-system works fine (no NetPols in that namespace).

9.3 vmagent appears healthy but no data in Grafana

vmagent's /-/healthy returns 200 as long as the process is alive and remote-write is TCP-functional. It doesn't check that scrapes are actually succeeding. The liveness probe in vmagent.yaml queries /api/v1/targets and fails the pod if no target is up. After ~3 failures (~3 min), kubelet recycles it.

If vmagent runs for weeks but Grafana is empty, the probe was disabled or the exec command broke.

9.4 vmagent bearer token destroyed by direct kubectl apply

The committed vmagent.yaml has bearer_token: TOKEN_PLACEHOLDER. The real token is sed-substituted at deploy time by 03-deploy.sh. Applying the file directly:

kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml   # WRONG

overwrites the Secret with the literal TOKEN_PLACEHOLDER and remote-writes 401. To restore without a full redeploy:

OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
                  -o jsonpath='{.data.OBS_INGEST_TOKEN}')
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
  -p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
kubectl -n honeydue rollout restart deploy/vmagent

Or just re-run ./deploy-k3s/scripts/03-deploy.sh — the sed handles it.

9.5 Dashboard queries: api metrics need service="api" not namespace="honeydue"

vmagent's scrape config (vmagent-config ConfigMap) explicitly chooses which Kubernetes pod-metadata labels to copy onto each scraped series. Namespace isn't one of them. Labels you can use on api-side metrics:

  • service (literal "api")
  • job (literal "api")
  • pod (the api pod name)
  • node (the k8s node name)
  • cluster (vmagent external_label, currently "honeydue-k3s")
  • environment (vmagent external_label, currently "prod")
  • Plus each metric's own labels (method, route, status for HTTP; etc.)

kube_* metrics from kube-state-metrics DO carry namespace natively (KSM publishes it as a label, vmagent passes it through). Loki streams have namespace because alloy-logs explicitly relabels it. So the rule is:

Metric prefix Use
kube_* namespace="honeydue"
http_*, gorm_*, go_*, process_* (api) service="api"
Loki logs {...} namespace="honeydue"

9.6 Cluster-label collision when two clusters run together

Both Hetzner and OVH vmagents push as cluster=honeydue-k3s, environment=prod (same external_labels). During the migration overlap this made dashboards sum both clusters' data. The simplest narrowing during overlap is by node name pattern (node=~"vps-.*" for OVH, node=~"ubuntu-.*" for Hetzner). If you ever bring up a backup cluster long-term, change one cluster's external_labels.cluster to something distinct (e.g. honeydue-ovh vs. honeydue-backup).

9.7 Worker double-firing scheduled jobs

If two worker Deployments run concurrently (e.g. two clusters both pointing at the same Neon DB), Asynq schedulers each fire crons independently — users get duplicate emails. Workaround: scale all-but-one worker to 0. This is the exact mechanic used during cutovers (§6.11).

9.8 Node kubeconfig mode

/etc/rancher/k3s/k3s.yaml on each node is mode 0600 because we install with --write-kubeconfig-mode=0600. Tightening from k3s default (0644) was intentional. Don't change without coordinating — any tooling on the node that expects to read it (none today) will break.


10. Differences from MIGRATION_NOTES.md (Hetzner-era)

MIGRATION_NOTES.md documents the Swarm → k3s migration on Hetzner (2026-04-24). Most of it still applies, with these OVH-specific deltas:

What MIGRATION_NOTES says What OVH actually has
hetzner-k3s provisioner Manual k3s install (§6)
Hetzner Load Balancer (not used) → Cloudflare round-robin Same — Cloudflare round-robin (§4)
Traefik as DaemonSet + hostNetwork via HelmChartConfig Traefik default Deployment + klipper-lb svclb DaemonSet. The traefik-helmchartconfig.yaml file is NOT applied on OVH.
servicelb disabled (--disable=servicelb) servicelb enabled (we didn't pass --disable=servicelb). This is what makes klipper-lb work.
sysctl net.ipv4.ip_unprivileged_port_start=0 for hostNetwork Traefik Not needed — klipper-lb proxies the port binding instead
UFW rules between 3 Hetzner IPs UFW rules between 3 OVH IPs (51.81.83.33, 51.81.87.86, 51.81.85.248)
Kubeconfig at ~/.kube/honeydue-k3s.yaml Kubeconfig at deploy-k3s/kubeconfig
TLS at origin: not configured (CF Flexible) Same — CF Flexible. cloudflare-origin-cert Secret exists (carried over) but Ingress doesn't reference it.

11. Outstanding follow-ups (deferred, not blocking)

  1. No warm standby / rollback cluster. OVH is solo production. An OVH outage is a real outage; recovery time = §6 procedure (~30 min). User plans to bring a second cluster up as a target.
  2. UFW allows 80/443 from world. Hetzner had a network-layer Cloudflare-IP allowlist on these ports. OVH currently relies on the L7 cloudflare-only Traefik middleware, which protects admin but NOT api / web / apex (those routes have to be reachable from anywhere, but they're then trivially DDoSable bypassing Cloudflare). Fix: add ufw allow rules restricting 80/tcp and 443/tcp to Cloudflare's published IP ranges (~22 IPv4 prefixes from https://www.cloudflare.com/ips-v4/).
  3. Cloudflare TLS Flexible → Full(strict). Origin certs exist as Secret but Ingress doesn't terminate TLS. Upgrading to Full(strict) requires Traefik configured with the cert + an HTTPS entrypoint + Ingress tls: block.
  4. rbac.yaml + pod-disruption-budgets.yaml should be in 03-deploy.sh. They're currently bootstrap-only. Adding them is idempotent and prevents the §9.1 footgun.
  5. Push notification metrics are log-derived, not counters. Successes aren't logged or counted. Proper Prometheus instrumentation (~15 lines in internal/push/client.go) would give a real success/failure ratio.
  6. Worker has no /metrics endpoint. cmd/worker/main.go serves :6060 for healthz only. Adding Asynq's metrics.NewPrometheusExporter() + a ServiceMonitor + uncommenting the worker job stanza in vmagent-config ConfigMap would give real queue depth and job latency.
  7. Ory Kratos. Manifests exist (manifests/kratos/) but the deploy is gated on operator-side prerequisites (Neon kratos database, auth.myhoneydue.com DNS, real Apple+Google OIDC clients, Kratos image tag pinned). Until kratos-secrets exists, 03-deploy.sh silently skips the Kratos apply.
  8. **Hetzner cluster fully retired? config.yaml nodes: block describes OVH; the bak kubeconfig is at kubeconfig.hetzner.bak. Boxes themselves are operator-managed.

11.1 Dashboard observability gaps (raised 2026-06-03 during dashboard build)

Surfaced while building the honeydue-eli5-overview Grafana dashboard. Each needs code or infra changes to expose; none blocks today's operations.

  1. node-exporter not deployed. No node-level metrics today (node_filesystem_avail_bytes, node_memory_*, node_load1, etc.). The dashboard's pod-level memory/CPU panels are app-process only — a node running out of disk would silently fail the cluster before any dashboard signal showed it. Highest-priority Tier-3 item. Fix: deploy node-exporter as a DaemonSet (~50 lines of YAML), add a scrape stanza to vmagent-config, add a Node disk free stat panel.
  2. Traefik metrics not enabled. Traefik can expose /metrics with traefik_entrypoint_requests_total + traefik_service_request_duration_seconds, giving edge-level visibility into requests that never reached api pods (404s, redirects, middleware blocks). Enable via a HelmChartConfig override that sets metrics.prometheus.entryPoint=metrics
    • adds a :9100 entryPoint + a scrape stanza. Skipped today to avoid Traefik restart risk; safe additive change when ready.
  3. Push notification success/failure counters (already #5). Add prometheus.NewCounterVec in internal/push/client.go with labels platform={ios,android}, outcome={success,failed,breaker_open,disabled}. Increments at every Send/SendActionable branch. Replaces the log-derived "Push failures" stat on the dashboard with a real success rate.
  4. Worker queue / job metrics (already #6). Asynq has a built-in Prometheus exporter (asynq/x/metrics). Wire it into the worker's :6060 health server (a single healthMux.Handle line) and uncomment the worker scrape stanza in vmagent-config. Surfaces queue depth, retry count, processing time per task type.
  5. Cache hit / miss rate. internal/services/cache_service.go has no counters. Add a Counter with labels {operation=get|set, result=hit|miss} around the cache wrapper. ~10 lines. Useful once real traffic flows to verify the ETag and Redis caches are paying their keep.
  6. APNs send-latency histogram. Wrap internal/push/apns.go::Send in a prometheus.NewHistogramVec keyed on outcome. Tells you when Apple's gateway is slow (which correlates with their incident page).

12. Audit trail

Date Change
2026-04-24 Initial k3s cluster on Hetzner (Swarm → k3s migration) — see MIGRATION_NOTES.md
2026-04-25 config.yaml reconstructed from live ConfigMap (original file lost)
2026-05-15 Audit fixes: Redis auth required, admin basic auth, secrets-encryption flag
2026-05-16 02-setup-secrets.sh started carrying B2 credentials (was a manifest/script drift)
2026-06-02 Kratos scaffolding committed (not deployed)
2026-06-03 Hetzner → OVH BHS cutover. New 3-node cluster on 51.81.83.33, .87.86, .85.248. DNS cut on Cloudflare. Hetzner kubeconfig moved to .bak. Grafana honeydue-eli5-overview dashboard created. Hetzner cluster powered off later same day.
2026-06-03 Dashboard build-out: extended honeydue-eli5-overview to 22 panels covering Tier-1 (HTTP status, CPU per pod, goroutines, top slow) and Tier-2 (GC, network I/O, pod uptime, top 5xx) signals. Surfaced Tier-3 instrumentation gaps in §11.1.