Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover: - Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions, network/firewall, DNS, filesystem layout — all reflect the live OVH install instead of the historical Hetzner setup. - Section 6: canonical install-from-clean-boxes procedure (the literal commands run on 2026-06-03), so anyone can stand up a backup cluster by following along. - Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away, healthy-but-empty) and adds four new ones discovered during the OVH build: rbac.yaml not in 03-deploy.sh, namespace label missing from api metrics (use service="api"), cluster-label collision when two clusters push concurrently, worker double-firing on cutover. - Section 11.1: enumerates Tier-3 observability gaps surfaced while building the honeydue-eli5-overview dashboard (node-exporter not deployed, Traefik metrics off, push success counters absent, worker /metrics endpoint absent, cache hit rate uninstrumented, APNs latency uninstrumented). - Section 12: dated audit trail of cluster changes. Pure documentation; no code or manifest changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
40 KiB
honeyDue k3s Cluster — Operations Runbook
Living document for the honeyDue production cluster. Add entries when you hit something non-obvious so future-you (or your replacement) doesn't have to rediscover it.
Last full revision: 2026-06-03 (Hetzner → OVH BHS cutover; cluster solo
production from that date forward). For pre-OVH history, see
MIGRATION_NOTES.md (Swarm → k3s migration on Hetzner, 2026-04-24).
1. Topology and inventory
Hosting
| Provider | OVHcloud (us.ovhcloud.com) |
| Datacenter | BHS — Beauharnois, Quebec, Canada |
| Plan | VPS-1 × 3 (~$6.46/mo each, ~$19/mo total) |
| Node spec | 4 vCPU (Intel Haswell, shared), 7.6 GB RAM, 75 GB NVMe |
| Public bandwidth | 400 Mbps per node, unlimited traffic |
| Private network | None. Nodes have public IPv4 + IPv6 only; inter-node traffic crosses the public internet (encrypted by flannel WireGuard backend — see §3) |
Nodes
| SSH alias | Kubernetes node name | Public IPv4 | Public IPv6 | Roles |
|---|---|---|---|---|
ovhcloud1 |
vps-1624d691 |
51.81.83.33 |
2604:2dc0:101:200::5a9a |
control-plane, etcd, redis-pinned |
ovhcloud2 |
vps-c0f51be2 |
51.81.87.86 |
2604:2dc0:101:200::30d4 |
control-plane, etcd |
ovhcloud3 |
vps-dbca24c7 |
51.81.85.248 |
2604:2dc0:101:200::450f |
control-plane, etcd |
The cluster is all-control-plane (workloads schedule on the same nodes that
run etcd and the API server). vps-1624d691 carries the
honeydue/redis=true label so the Redis Deployment's nodeSelector binds
there; the Redis PVC (local-path, host-pinned) lives on that node's disk.
SSH access
~/.ssh/config entries (operator workstation):
Host ovhcloud1
HostName 51.81.83.33
Port 22
User ubuntu
IdentityFile ~/.ssh/ovhcloud
IdentitiesOnly yes
Host ovhcloud2
HostName 51.81.87.86
Port 22
User ubuntu
IdentityFile ~/.ssh/ovhcloud
IdentitiesOnly yes
Host ovhcloud3
HostName 51.81.85.248
Port 22
User ubuntu
IdentityFile ~/.ssh/ovhcloud
IdentitiesOnly yes
ubuntu has passwordless sudo (/etc/sudoers.d/90-cloud-init-users from OVH's
cloud-init).
kubectl access
export KUBECONFIG=/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/deploy-k3s/kubeconfig
kubectl get nodes
The deploy-k3s/kubeconfig file (mode 0600, gitignored) is the OVH cluster's
admin kubeconfig with server: https://51.81.83.33:6443. A stale Hetzner copy
lives next to it as kubeconfig.hetzner.bak for historical reference; the
Hetzner cluster is powered off and that file's API server is unreachable.
To refresh from the cluster (if the local copy is lost or rotated):
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
The k3s API at :6443 is open to the public internet (token-protected).
2. Software
Kernel-level
| OS | Ubuntu 26.04 LTS (set by OVH's VPS-1 image) |
| Kernel | 7.0.0-14-generic |
| Init | systemd |
| Container runtime | containerd 2.2.2 (bundled with k3s) |
| Firewall | ufw (per-node, configured at install — see §3) |
| Other host packages | fail2ban (SSH brute-force protection, default jail), unattended-upgrades (security updates), open-iscsi (k3s prereq for some storage backends), curl |
Kubernetes
| Distribution | k3s |
| Version | v1.34.6+k3s1 (pinned in config.yaml:cluster.k3s_version) |
| Control plane | 3-node HA, embedded etcd (no external Postgres backing store) |
| CNI / networking | flannel with WireGuard-native backend (--flannel-backend=wireguard-native). Encrypts pod-to-pod and etcd peer traffic because nodes only have public IPs (no private network). ~3-5% CPU overhead under load. |
| Service LB | klipper-lb (default k3s servicelb). The svclb-traefik DaemonSet binds host ports :80 and :443 on each node and forwards to the Traefik Service. Not the DaemonSet-w/-hostNetwork Traefik pattern used on the old Hetzner cluster — see §10 Differences from MIGRATION_NOTES. |
| Ingress controller | Traefik (k3s default), single-replica Deployment, exposed via klipper-lb |
| DNS | CoreDNS (k3s default) |
| Secrets encryption | Enabled (--secrets-encryption); etcd values are AES-CBC encrypted at rest |
| kubeconfig perms | 0600 (--write-kubeconfig-mode=0600) |
| Cloud controller | Disabled (--disable-cloud-controller) — no provider integration on OVH |
| Misc | --node-ip / --node-external-ip / --advertise-address all set to each node's public IPv4. TLS SANs cover all 3 IPs so any IP can serve the API. |
Application stack (in cluster, honeydue namespace)
| Deployment | Replicas | Image (digest-pinned) | Notes |
|---|---|---|---|
api |
3 | gitea.treytartt.com/admin/honeydue-api@sha256:34fde6... |
Go REST API on :8000, exposes /metrics |
web |
3 | gitea.treytartt.com/admin/honeydue-web@sha256:8c62cf... |
Next.js, server-side proxy to api |
admin |
1 | gitea.treytartt.com/admin/honeydue-admin@sha256:b81263... |
Next.js admin panel, gated behind Traefik basic-auth |
worker |
1 | gitea.treytartt.com/admin/honeydue-worker@sha256:fe1f5e... |
Asynq scheduler + Redis-backed jobs (singleton — must not run as >1 replica or every cron fires N×) |
redis |
1 | redis:7-alpine@sha256:6ab0b6... |
Pinned to vps-1624d691 via honeydue/redis=true. PVC redis-data (local-path, 5 Gi). Password-auth required. |
vmagent |
1 | victoriametrics/vmagent@sha256:... (default tag) |
Scrapes api /metrics + kube-state-metrics; remote-writes to obs.88oakapps.com |
kube-state-metrics |
1 | kube-state-metrics@sha256:... |
In kube-system, scraped by vmagent for kube_* cluster-state metrics |
alloy-logs (DaemonSet) |
3 (1/node) | grafana/alloy@sha256:... |
Tails /var/log/pods/* and ships to Loki at obs.88oakapps.com |
The Asynq scheduler inside worker registers these cron jobs:
| Cron | Job | Notes |
|---|---|---|
0 * * * * |
Smart reminder check (per-user hour) | Default user hour: 14:00 UTC |
0 * * * * |
Daily digest check (per-user hour) | Default user hour: 03:00 UTC |
0 10 * * * |
Onboarding emails | 10:00 UTC |
0 3 * * * |
Reminder log cleanup | 03:00 UTC |
30 * * * * |
Pending uploads cleanup | xx:30 every hour |
External dependencies
| Service | Endpoint | Purpose | Failure mode |
|---|---|---|---|
| Neon Postgres | ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech:5432 |
App data. Pooler endpoint (transaction-mode PgBouncer in front of Neon compute) so connections stay warm. | api / worker pods crash-loop with dial tcp: connection refused. Health endpoint returns postgres: error. |
| Backblaze B2 (S3-compatible) | s3.us-east-005.backblazeb2.com (bucket honeyDueProd) |
User uploads (photos, PDFs, completion attachments) | Upload routes return 5xx; reads of cached/static files still work. |
| Cloudflare | myhoneydue.com zone |
DNS + TLS termination + edge cache + DDoS | Traffic stops reaching origin. Direct https://51.81.x.x still works for diagnostics. |
| obs.88oakapps.com | Operator-run Grafana + VictoriaMetrics + Loki | Metrics & logs | vmagent + alloy-logs back off and retry. No app-side impact. |
| Apple APNs | api.push.apple.com:443 (production) |
iOS push notifications | Push fails; circuit breaker opens; failure logged. App functionality unaffected. |
| Fastmail SMTP | smtp.fastmail.com:587 |
Transactional emails (verification, recovery, digests) | Email send fails in the worker; logged; user reset/digest flow degrades. |
| Gitea registry | gitea.treytartt.com |
Container image registry | Deploys can't pull. Existing pods keep running on cached images. |
3. Network and firewall
Per-node ufw configuration
Applied during install (same on all 3 nodes):
default deny incoming
default allow outgoing
allow 22/tcp (SSH, world)
allow 80/tcp (HTTP via Cloudflare, world — see GAP-1)
allow 443/tcp (HTTPS, same — GAP-1)
allow 6443/tcp (k3s API, world, token-protected)
allow 2379:2380/tcp from <other 2 OVH IPs> (etcd client + peer)
allow 10250/tcp from <other 2 OVH IPs> (kubelet)
allow 51820/udp from <other 2 OVH IPs> (WireGuard tunnel)
allow 8472/udp from <other 2 OVH IPs> (VXLAN, defense-in-depth fallback)
To inspect: ssh ovhcloudN sudo ufw status numbered.
Cluster networking
- Pod CIDR:
10.42.0.0/16(default k3s) - Service CIDR:
10.43.0.0/16(default k3s) - Flannel backend: WireGuard-native. Each node hosts a
flannel-wginterface on UDP 51820 and tunnels pod traffic to peers. Verify:ssh ovhcloudN ip -d link show flannel-wg.
Traefik ingress flow
Cloudflare → node:80/443 (public)
→ klipper-lb svclb-traefik DaemonSet pod (hostPort:80/443)
→ Traefik Service (ClusterIP 10.43.245.127:80/443)
→ Traefik Deployment pod (single replica)
→ matches Ingress host rule (api.myhoneydue.com etc.)
→ routes to backend Service (api / web / admin)
→ backend Pod
The Traefik default also lives in kube-system and is managed by k3s's
HelmChart. No HelmChartConfig override is applied on OVH (unlike Hetzner
— see §10).
4. DNS configuration (Cloudflare)
The myhoneydue.com zone in Cloudflare has these public records. All
hostnames are proxied (orange cloud) — required by the cloudflare-only
Traefik middleware which 403s any non-CF source IP.
| Host | Type | Values | Proxy |
|---|---|---|---|
api.myhoneydue.com |
A × 3 | 51.81.83.33, 51.81.87.86, 51.81.85.248 |
Proxied |
app.myhoneydue.com |
A × 3 | (same trio) | Proxied |
admin.myhoneydue.com |
A × 3 | (same trio) | Proxied |
myhoneydue.com (apex @) |
A × 3 | (same trio) | Proxied |
Cloudflare round-robins among the 3 origins, klipper-lb on whichever node CF hits forwards to Traefik, and Traefik routes by Host header. Per-request, effectively load-balanced across the 3 nodes for ingress, with no central LB.
SSL/TLS mode: Flexible (CF terminates TLS at the edge; origin is plain
HTTP on :80). Upgrading to Full (strict) is on the deferred list — would
need an origin certificate provisioned to cloudflare-origin-cert secret and
Traefik configured for TLS termination.
5. Filesystem layout (deploy-k3s/)
deploy-k3s/
├── config.yaml # Single config source (gitignored; contains tokens)
├── config.yaml.example # Template
├── kubeconfig # OVH admin kubeconfig (gitignored, 0600)
├── kubeconfig.hetzner.bak # Old Hetzner kubeconfig (unreachable, kept for history)
├── kubeconfig.tunnel # Optional: localhost-pointing copy for SSH-tunnel use
├── secrets/
│ ├── README.md
│ ├── postgres_password.txt # Neon DB password
│ ├── secret_key.txt # 32+ char app-token signing secret
│ ├── email_host_password.txt # Fastmail SMTP app password
│ ├── fcm_server_key.txt # FCM server key (currently unused — Android push disabled)
│ ├── apns_auth_key.p8 # APNs auth key (binary)
│ ├── cloudflare-origin.crt # Origin certificate (currently unused — CF Flexible)
│ └── cloudflare-origin.key
│ (all gitignored except README.md)
├── manifests/
│ ├── namespace.yaml
│ ├── network-policies.yaml # default-deny + per-app egress/ingress (13 NetPols total)
│ ├── rbac.yaml # api/worker/admin/web/redis ServiceAccounts (NOT applied by 03-deploy.sh; manual once)
│ ├── pod-disruption-budgets.yaml # api-pdb, web-pdb, worker-pdb (NOT applied by 03-deploy.sh; manual once)
│ ├── traefik-helmchartconfig.yaml # Hetzner-only DaemonSet+hostNetwork override (do NOT apply on OVH; we use default klipper-lb)
│ ├── kyverno-verify-images.yaml # Operator-gated policy (do NOT apply blindly — see file comment)
│ ├── api/{deployment,service,hpa}.yaml
│ ├── worker/deployment.yaml
│ ├── admin/{deployment,service}.yaml
│ ├── web/{deployment,service}.yaml
│ ├── redis/{deployment,service,pvc}.yaml
│ ├── ingress/{middleware,ingress-simple}.yaml
│ ├── migrate/job.yaml # goose migration Job (image-subbed at deploy time)
│ ├── observability/{kube-state-metrics,vmagent,alloy-logs}.yaml
│ └── kratos/ # Ory Kratos identity service (NOT yet deployed; gated on operator OIDC setup)
└── scripts/
├── _config.sh # Sourced by all scripts: cfg(), generate_env(), generate_cluster_config()
├── 01-provision-cluster.sh # Hetzner-Cloud-specific (uses hetzner-k3s CLI) — DO NOT RUN ON OVH
├── 02-setup-secrets.sh # Creates honeydue-secrets etc. from secrets/ + config.yaml; kubeconfig-driven
├── 03-deploy.sh # Build + push + apply manifests + roll deployments; kubeconfig-driven
├── 04-verify.sh # Post-deploy health + security checks; kubeconfig-driven
└── rollback.sh # `kubectl rollout undo` across all deployments
The deploy/prod.env file (sibling to deploy-k3s/, gitignored) holds
observability + admin credentials that 02/03-deploy.sh read but never
display:
OBS_INGEST_URL (https://obs.88oakapps.com/api/v1/write)
OBS_TRACES_URL (https://obs.88oakapps.com/v1/traces)
OBS_INGEST_TOKEN (bearer token for VM + Loki + traces — all use same token)
GRAFANA_URL (https://grafana.88oakapps.com)
GRAFANA_ADMIN_USER (admin)
GRAFANA_ADMIN_PASSWORD
ADMIN_EMAIL / ADMIN_PASSWORD (in-app admin login)
6. Install from clean boxes — the truthful procedure
This is what we ran on 2026-06-03 to stand up the live cluster, exactly. If you ever rebuild from zero this is the canonical sequence. Total wall-clock: ~12 min for cluster bootstrap; ~10 min for workloads.
6.1 Prerequisites
- 3 fresh Ubuntu VPS instances (any provider with public IPv4, ≥4 GB RAM, ≥40 GB disk)
~/.ssh/configentries (ovhcloud1/2/3) pointing at them, with passwordless sudo- Local
kubectlandcurl - The repo's
deploy-k3s/secrets/populated (or the ability to copy live secrets from another running cluster — see §7.2) deploy/prod.envpopulated with obs token + Grafana creds
6.2 Per-node OS hardening + firewall (all 3 in parallel)
For each ovhcloudN, over SSH:
export DEBIAN_FRONTEND=noninteractive
sudo apt-get update -qq
sudo apt-get install -y -qq fail2ban unattended-upgrades open-iscsi curl ufw
sudo systemctl enable --now iscsid fail2ban
sudo dpkg-reconfigure -f noninteractive -plow unattended-upgrades
sudo ufw --force reset
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 6443/tcp
SELF=$(hostname -I | awk '{print $1}')
for peer in 51.81.83.33 51.81.87.86 51.81.85.248; do
[ "$peer" = "$SELF" ] && continue
sudo ufw allow from "$peer" to any port 2379:2380 proto tcp
sudo ufw allow from "$peer" to any port 10250 proto tcp
sudo ufw allow from "$peer" to any port 51820 proto udp
sudo ufw allow from "$peer" to any port 8472 proto udp
done
sudo ufw --force enable
Watch ordering: allow 22/tcp MUST precede ufw enable. Existing SSH
sessions survive (ufw only affects new connections), but a misordered script
locks you out of fresh logins.
6.3 Install k3s on ovhcloud1 (the init node)
ssh ovhcloud1 'curl -sfL https://get.k3s.io | \
INSTALL_K3S_VERSION=v1.34.6+k3s1 \
sh -s - server \
--cluster-init \
--node-ip=51.81.83.33 \
--node-external-ip=51.81.83.33 \
--advertise-address=51.81.83.33 \
--flannel-backend=wireguard-native \
--flannel-external-ip \
--secrets-encryption \
--write-kubeconfig-mode=0600 \
--tls-san=51.81.83.33 \
--tls-san=51.81.87.86 \
--tls-san=51.81.85.248 \
--disable-cloud-controller'
Wait for sudo k3s kubectl get nodes to show this node Ready (~2-5 s).
Read the cluster token:
ssh ovhcloud1 'sudo cat /var/lib/rancher/k3s/server/node-token'
6.4 Join ovhcloud2, then ovhcloud3 (sequential)
Joining etcd one node at a time avoids split-brain on slow networks.
Replace <TOKEN> with the value from 6.3.
For ovhcloud2:
ssh ovhcloud2 'curl -sfL https://get.k3s.io | \
INSTALL_K3S_VERSION=v1.34.6+k3s1 \
K3S_TOKEN=<TOKEN> \
sh -s - server \
--server=https://51.81.83.33:6443 \
--node-ip=51.81.87.86 \
--node-external-ip=51.81.87.86 \
--advertise-address=51.81.87.86 \
--flannel-backend=wireguard-native \
--flannel-external-ip \
--secrets-encryption \
--write-kubeconfig-mode=0600 \
--tls-san=51.81.83.33 --tls-san=51.81.87.86 --tls-san=51.81.85.248 \
--disable-cloud-controller'
Then identical for ovhcloud3 with --node-ip=51.81.85.248 and
--advertise-address=51.81.85.248. After each, wait for kubectl get nodes
to show the new node Ready before proceeding.
6.5 Pull kubeconfig to the operator workstation
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
kubectl get nodes -o wide # All 3 Ready, INTERNAL-IP = public IP
6.6 Label the redis node
kubectl label node vps-1624d691 honeydue/redis=true --overwrite
(Use whichever k8s node name corresponds to ovhcloud1. The Redis
Deployment's nodeSelector binds to this label.)
6.7 Bootstrap manifests NOT applied by 03-deploy.sh
These must be applied manually on a fresh cluster, before running
03-deploy.sh, or workloads will fail to schedule:
kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl apply -f deploy-k3s/manifests/pod-disruption-budgets.yaml
rbac.yaml creates the 5 ServiceAccounts (api, worker, admin, web,
redis) referenced by the Deployment manifests. Without these, ReplicaSets
hang on FailedCreate: error looking up service account and pods never
start. Symptom on first deploy: kubectl get deploy shows 0 up-to-date
across the board with no pod activity — see §9 Gotchas.
Do NOT apply traefik-helmchartconfig.yaml (Hetzner-only — see §10) or
kyverno-verify-images.yaml (gated on operator Kyverno install).
6.8 Seed secrets
Two paths; pick whichever fits your situation:
Path A — clean install from local files (the original design):
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/02-setup-secrets.sh
Requires deploy-k3s/secrets/ to contain real postgres_password.txt,
secret_key.txt, email_host_password.txt, fcm_server_key.txt,
apns_auth_key.p8, cloudflare-origin.crt, cloudflare-origin.key. The
script reads config.yaml for registry.*, redis.password,
admin.basic_auth_*, and storage.b2_*.
Path B — clone live secrets from another running cluster (what we
actually did during the migration; useful if secrets/ is empty or you want
exact-byte equivalence):
HETZNER=$(pwd)/deploy-k3s/kubeconfig.hetzner.bak # or any kubeconfig with the secrets
OVH=$(pwd)/deploy-k3s/kubeconfig
kubectl --kubeconfig=$OVH apply -f deploy-k3s/manifests/namespace.yaml
for S in honeydue-secrets honeydue-apns-key gitea-credentials cloudflare-origin-cert admin-basic-auth; do
kubectl --kubeconfig=$HETZNER -n honeydue get secret $S -o json \
| python3 -c "
import json, sys
d = json.load(sys.stdin)
m = d['metadata']
for k in ('uid','resourceVersion','creationTimestamp','generation','managedFields','ownerReferences','selfLink'):
m.pop(k, None)
m.pop('annotations', None)
print(json.dumps(d))" \
| kubectl --kubeconfig=$OVH apply -f -
done
After either path, verify:
kubectl -n honeydue get secrets
# Expect: admin-basic-auth, cloudflare-origin-cert, gitea-credentials,
# honeydue-apns-key, honeydue-secrets
6.9 Deploy workloads
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig \
./deploy-k3s/scripts/03-deploy.sh --skip-build --tag latest
--skip-buildskips Docker build + push, deploys whatever's already in the registry at the named tag. Use this when migrating between clusters to guarantee both run identical bits.- Without flags it builds the api / worker / admin / web images from the
local repo HEAD and pushes to
gitea.treytartt.comfirst. - The script applies (in order): namespace, network-policies (13 of them), redis, ingress, then runs the goose migration Job (blocking on success), then api / worker / admin / web Deployments, then observability (kube-state-metrics, vmagent, alloy-logs).
- It does NOT apply:
rbac.yaml,pod-disruption-budgets.yaml,traefik-helmchartconfig.yaml,kyverno-verify-images.yaml. The first two must be applied manually (see §6.7); the latter two are Hetzner-only or operator-gated. - It does NOT apply: anything under
kratos/(skipped untilkratos-secretsexists, which requires real OIDC client IDs).
6.10 Verify
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/04-verify.sh
Expect: all deployments READY=desired, 13 NetworkPolicies, 7 ServiceAccounts
(api, worker, admin, web, redis, vmagent, alloy-logs), 3 PDBs, cloudflare-only
middleware present, in-cluster /api/health/ returns 200.
External smoke test (DNS-aware, but the api /health/ route is exempt from
the cloudflare-only middleware so direct-IP works for diagnostics):
for IP in 51.81.83.33 51.81.87.86 51.81.85.248; do
curl -s -o /dev/null -w "$IP -> %{http_code}\n" \
-H 'Host: api.myhoneydue.com' http://$IP/api/health/
done
# All three should return 200.
6.11 DNS cutover (if migrating)
In the Cloudflare dashboard for myhoneydue.com, set the 4 hostnames in §4 to
the OVH IPs and keep proxied. Effective propagation ~30 s to 5 min through
the Cloudflare proxy.
If you have a previous cluster, scale its worker to 0 before flipping to avoid scheduled-job double-fires:
KUBECONFIG=<previous> kubectl -n honeydue scale deploy/worker --replicas=0
# (cut DNS)
KUBECONFIG=<new> kubectl -n honeydue scale deploy/worker --replicas=1
Run those last two lines back-to-back. Worker work is mostly scheduled (hourly+), so a brief gap is harmless; overlap would cause duplicate emails.
7. Day-to-day operations
Common kubectl one-liners
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
# Cluster state
kubectl get nodes -o wide
kubectl -n honeydue get pods
kubectl -n honeydue get deploy
kubectl top nodes
kubectl -n honeydue top pods
# Tail logs
kubectl -n honeydue logs deploy/api -f --tail=50
kubectl -n honeydue logs -l app.kubernetes.io/name=api -f --tail=20
stern -n honeydue api # if stern is installed (multi-pod)
# Restart a deployment (no image change, picks up ConfigMap changes)
kubectl -n honeydue rollout restart deploy/api
# Rollback one revision
kubectl -n honeydue rollout undo deploy/api
# Scale (worker MUST stay at 0 or 1)
kubectl -n honeydue scale deploy/api --replicas=4
# Get into a pod
kubectl -n honeydue exec -it deploy/api -- sh
Redeploy after code changes
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/03-deploy.sh
Builds images from local HEAD, tags with the git short SHA, pushes to Gitea,
runs goose up (idempotent), rolls api/worker/admin/web. Total: ~3-5 min
when images change.
To deploy without rebuilding (pin to a specific tag):
./deploy-k3s/scripts/03-deploy.sh --skip-build --tag <tag-or-:latest>
Migrations
Goose migrations live in migrations/. New file pattern:
make migrate-new name=add_foo_column # generates migrations/YYYYMMDDHHMMSS_add_foo_column.sql
# Edit the file with -- +goose Up / -- +goose Down sections
03-deploy.sh runs a one-shot Job (manifests/migrate/job.yaml) that
executes goose up against Neon (direct compute endpoint, not pooler — see
file comment). The Job blocks api/worker rollout and aborts the deploy on
failure. No app pod runs AutoMigrate; api/worker startup verifies
goose_db_version is current and refuses to boot on mismatch.
Grafana
URL: https://grafana.88oakapps.com (creds in deploy/prod.env)
Three dashboards in the honeyDue folder:
| UID | Title | Use |
|---|---|---|
honeydue-eli5-overview |
honeyDue — Overview (ELI5) | Single-screen at-a-glance health: pods up, crashes, errors, RPS, latency, Postgres, memory, top endpoints, push failures, worker activity, recent error logs. Created 2026-06-03. |
honeydue-red |
honeyDue API — RED | Rate/Errors/Duration cuts (legacy) |
honeydue-logs |
honeyDue — Production Logs | Live log explorer |
For the ELI5 dashboard's queries, api-side metrics use service="api",
NOT namespace="honeydue". vmagent's scrape config drops the namespace
label from api metrics — only service, pod, node, job, plus the
metric's own labels (route, method, status, etc.) survive. Queries that
filter on namespace="honeydue" for api metrics silently match nothing.
kubectl tunnel (if 6443 is firewalled to your IP)
Currently 6443 is open WAN-side (matching the previous Hetzner posture).
If you tighten that to operator-IPs-only and your IP changes, use an SSH
tunnel:
ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
-i ~/.ssh/ovhcloud \
-L 127.0.0.1:6443:127.0.0.1:6443 \
ubuntu@51.81.83.33
cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
sed -i.bak 's|https://51.81.83.33:6443|https://127.0.0.1:6443|' deploy-k3s/kubeconfig.tunnel
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
8. Disaster recovery
"I lost the kubeconfig"
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
If ovhcloud1 is down but ovhcloud2 or 3 is up, swap host and IP — the
TLS SAN covers all three.
"A node is unresponsive"
kubectl drain vps-XXX --ignore-daemonsets --delete-emptydir-data
# Reboot via OVH manager or:
ssh ovhcloudN sudo reboot
# Wait for Ready, then:
kubectl uncordon vps-XXX
The cluster tolerates 1 node down (etcd quorum 2/3). With 2 down, etcd loses quorum and the API server stops accepting writes.
"etcd quorum lost (2+ nodes dead)"
Bring nodes back online if possible. If not:
ssh ovhcloud1 'sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<latest>'
k3s takes automatic etcd snapshots every 12h, keeping 5. List with:
ssh ovhcloud1 sudo ls -la /var/lib/rancher/k3s/server/db/snapshots/
This is destructive — workload state since the snapshot is lost, but Neon (actual app data) is unaffected.
"I have to rebuild the whole cluster from scratch"
Provision 3 fresh boxes, then exactly the sequence in §6. End-to-end is ~30 min. The dependencies that make this possible:
| Stays put through rebuild | Where |
|---|---|
| Application data | Neon Postgres (managed) |
| User uploads | Backblaze B2 (managed) |
| Container images | gitea.treytartt.com (self-hosted, but not on the OVH cluster) |
| Operator secrets | deploy-k3s/secrets/ + config.yaml + deploy/prod.env on the operator workstation (gitignored) |
| DNS | Cloudflare control panel |
If gitea.treytartt.com is on the same OVH cluster, you have a circular
dependency — rebuilding requires images you can't pull until the cluster is
up. Currently Gitea is NOT in the honeyDue cluster (separate Hetzner-era
host), so this isn't a problem today, but worth flagging if that ever
changes.
"Cutover back to Hetzner / failover to a backup cluster"
There is no warm standby today. Bringing up a second cluster is the same §6 procedure on different hardware, then a Cloudflare DNS swap. The worker-swap dance is critical:
KUBECONFIG=<current> kubectl -n honeydue scale deploy/worker --replicas=0
# (Update Cloudflare DNS to new cluster's IPs — proxied)
KUBECONFIG=<new> kubectl -n honeydue scale deploy/worker --replicas=1
9. Known gotchas
9.1 First-deploy "0 up-to-date" across all Deployments
Symptoms: kubectl get deploy shows READY 0/N, UP-TO-DATE 0 for
api/worker/admin/web/redis. kubectl get events shows
FailedCreate: error looking up service account honeydue/<name>: serviceaccount "..." not found.
Cause: rbac.yaml (ServiceAccounts) is NOT applied by 03-deploy.sh. On
a fresh cluster the SAs don't exist; the ReplicaSet controller can't create
pods.
Fix:
kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl -n honeydue rollout restart deploy/api deploy/worker deploy/admin deploy/web deploy/redis
This was hit during the 2026-06-03 OVH bootstrap. Permanently fix by adding
kubectl apply -f rbac.yaml to 03-deploy.sh between the namespace and
network-policies apply, but until that lands, follow §6.7 on every fresh
cluster.
9.2 vmagent SD broken on fresh deploy ("0 pods up" in Grafana)
Symptoms:
- Grafana panels using
kube_*metrics orup{job=...}show 0 - vmagent logs:
dial tcp 10.43.0.1:443: connect: connection refusedevery ~30 s - Direct test from a pod also refused
Cause: k3s's NetworkPolicy controller evaluates egress rules after
kube-proxy's DNAT (not before, contrary to spec). Pod-to-kubernetes-Service
(10.43.0.1:443) gets DNAT'd to <node_ip>:6443, then the policy check
runs. Without an explicit egress rule for :6443, the packet is rejected.
The allow-egress-from-vmagent NetPol in network-policies.yaml includes
both rules:
- to:
- ipBlock: { cidr: 10.43.0.0/16 }
ports:
- { port: 443, protocol: TCP }
- to:
- ipBlock:
cidr: 0.0.0.0/0
except: [10.42.0.0/16]
ports:
- { port: 6443, protocol: TCP }
If this happens: confirm network-policies.yaml was applied:
kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443
Counter-evidence that confirms diagnosis: kube-state-metrics in
kube-system works fine (no NetPols in that namespace).
9.3 vmagent appears healthy but no data in Grafana
vmagent's /-/healthy returns 200 as long as the process is alive and
remote-write is TCP-functional. It doesn't check that scrapes are actually
succeeding. The liveness probe in vmagent.yaml queries /api/v1/targets
and fails the pod if no target is up. After ~3 failures (~3 min), kubelet
recycles it.
If vmagent runs for weeks but Grafana is empty, the probe was disabled or the exec command broke.
9.4 vmagent bearer token destroyed by direct kubectl apply
The committed vmagent.yaml has bearer_token: TOKEN_PLACEHOLDER. The real
token is sed-substituted at deploy time by 03-deploy.sh. Applying the
file directly:
kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml # WRONG
overwrites the Secret with the literal TOKEN_PLACEHOLDER and remote-writes
401. To restore without a full redeploy:
OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.OBS_INGEST_TOKEN}')
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
-p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
kubectl -n honeydue rollout restart deploy/vmagent
Or just re-run ./deploy-k3s/scripts/03-deploy.sh — the sed handles it.
9.5 Dashboard queries: api metrics need service="api" not namespace="honeydue"
vmagent's scrape config (vmagent-config ConfigMap) explicitly chooses which
Kubernetes pod-metadata labels to copy onto each scraped series. Namespace
isn't one of them. Labels you can use on api-side metrics:
service(literal"api")job(literal"api")pod(the api pod name)node(the k8s node name)cluster(vmagent external_label, currently"honeydue-k3s")environment(vmagent external_label, currently"prod")- Plus each metric's own labels (
method,route,statusfor HTTP; etc.)
kube_* metrics from kube-state-metrics DO carry namespace natively
(KSM publishes it as a label, vmagent passes it through). Loki streams have
namespace because alloy-logs explicitly relabels it. So the rule is:
| Metric prefix | Use |
|---|---|
kube_* |
namespace="honeydue" |
http_*, gorm_*, go_*, process_* (api) |
service="api" |
Loki logs {...} |
namespace="honeydue" |
9.6 Cluster-label collision when two clusters run together
Both Hetzner and OVH vmagents push as cluster=honeydue-k3s, environment=prod
(same external_labels). During the migration overlap this made dashboards
sum both clusters' data. The simplest narrowing during overlap is by node
name pattern (node=~"vps-.*" for OVH, node=~"ubuntu-.*" for Hetzner). If
you ever bring up a backup cluster long-term, change one cluster's
external_labels.cluster to something distinct (e.g. honeydue-ovh
vs. honeydue-backup).
9.7 Worker double-firing scheduled jobs
If two worker Deployments run concurrently (e.g. two clusters both pointing
at the same Neon DB), Asynq schedulers each fire crons independently — users
get duplicate emails. Workaround: scale all-but-one worker to 0. This is the
exact mechanic used during cutovers (§6.11).
9.8 Node kubeconfig mode
/etc/rancher/k3s/k3s.yaml on each node is mode 0600 because we install
with --write-kubeconfig-mode=0600. Tightening from k3s default (0644) was
intentional. Don't change without coordinating — any tooling on the node
that expects to read it (none today) will break.
10. Differences from MIGRATION_NOTES.md (Hetzner-era)
MIGRATION_NOTES.md documents the Swarm → k3s migration on Hetzner
(2026-04-24). Most of it still applies, with these OVH-specific deltas:
| What MIGRATION_NOTES says | What OVH actually has |
|---|---|
hetzner-k3s provisioner |
Manual k3s install (§6) |
| Hetzner Load Balancer (not used) → Cloudflare round-robin | Same — Cloudflare round-robin (§4) |
| Traefik as DaemonSet + hostNetwork via HelmChartConfig | Traefik default Deployment + klipper-lb svclb DaemonSet. The traefik-helmchartconfig.yaml file is NOT applied on OVH. |
servicelb disabled (--disable=servicelb) |
servicelb enabled (we didn't pass --disable=servicelb). This is what makes klipper-lb work. |
sysctl net.ipv4.ip_unprivileged_port_start=0 for hostNetwork Traefik |
Not needed — klipper-lb proxies the port binding instead |
| UFW rules between 3 Hetzner IPs | UFW rules between 3 OVH IPs (51.81.83.33, 51.81.87.86, 51.81.85.248) |
Kubeconfig at ~/.kube/honeydue-k3s.yaml |
Kubeconfig at deploy-k3s/kubeconfig |
| TLS at origin: not configured (CF Flexible) | Same — CF Flexible. cloudflare-origin-cert Secret exists (carried over) but Ingress doesn't reference it. |
11. Outstanding follow-ups (deferred, not blocking)
- No warm standby / rollback cluster. OVH is solo production. An OVH outage is a real outage; recovery time = §6 procedure (~30 min). User plans to bring a second cluster up as a target.
- UFW allows 80/443 from world. Hetzner had a network-layer Cloudflare-IP
allowlist on these ports. OVH currently relies on the L7
cloudflare-onlyTraefik middleware, which protects admin but NOT api / web / apex (those routes have to be reachable from anywhere, but they're then trivially DDoSable bypassing Cloudflare). Fix: add ufw allow rules restricting80/tcpand443/tcpto Cloudflare's published IP ranges (~22 IPv4 prefixes from https://www.cloudflare.com/ips-v4/). - Cloudflare TLS Flexible → Full(strict). Origin certs exist as Secret
but Ingress doesn't terminate TLS. Upgrading to Full(strict) requires
Traefik configured with the cert + an HTTPS entrypoint + Ingress
tls:block. rbac.yaml+pod-disruption-budgets.yamlshould be in03-deploy.sh. They're currently bootstrap-only. Adding them is idempotent and prevents the §9.1 footgun.- Push notification metrics are log-derived, not counters. Successes
aren't logged or counted. Proper Prometheus instrumentation (~15 lines in
internal/push/client.go) would give a real success/failure ratio. - Worker has no
/metricsendpoint.cmd/worker/main.goserves:6060for healthz only. Adding Asynq'smetrics.NewPrometheusExporter()+ a ServiceMonitor + uncommenting theworkerjob stanza invmagent-configConfigMap would give real queue depth and job latency. - Ory Kratos. Manifests exist (
manifests/kratos/) but the deploy is gated on operator-side prerequisites (Neonkratosdatabase,auth.myhoneydue.comDNS, real Apple+Google OIDC clients, Kratos image tag pinned). Untilkratos-secretsexists,03-deploy.shsilently skips the Kratos apply. - **Hetzner cluster fully retired?
config.yamlnodes:block describes OVH; the bak kubeconfig is atkubeconfig.hetzner.bak. Boxes themselves are operator-managed.
11.1 Dashboard observability gaps (raised 2026-06-03 during dashboard build)
Surfaced while building the honeydue-eli5-overview Grafana dashboard. Each
needs code or infra changes to expose; none blocks today's operations.
- node-exporter not deployed. No node-level metrics today
(
node_filesystem_avail_bytes,node_memory_*,node_load1, etc.). The dashboard's pod-level memory/CPU panels are app-process only — a node running out of disk would silently fail the cluster before any dashboard signal showed it. Highest-priority Tier-3 item. Fix: deploynode-exporteras a DaemonSet (~50 lines of YAML), add a scrape stanza tovmagent-config, add aNode disk freestat panel. - Traefik metrics not enabled. Traefik can expose
/metricswithtraefik_entrypoint_requests_total+traefik_service_request_duration_seconds, giving edge-level visibility into requests that never reached api pods (404s, redirects, middleware blocks). Enable via a HelmChartConfig override that setsmetrics.prometheus.entryPoint=metrics- adds a
:9100entryPoint + a scrape stanza. Skipped today to avoid Traefik restart risk; safe additive change when ready.
- adds a
- Push notification success/failure counters (already #5). Add
prometheus.NewCounterVecininternal/push/client.gowith labelsplatform={ios,android}, outcome={success,failed,breaker_open,disabled}. Increments at every Send/SendActionable branch. Replaces the log-derived "Push failures" stat on the dashboard with a real success rate. - Worker queue / job metrics (already #6). Asynq has a built-in
Prometheus exporter (
asynq/x/metrics). Wire it into the worker's:6060health server (a singlehealthMux.Handleline) and uncomment the worker scrape stanza invmagent-config. Surfaces queue depth, retry count, processing time per task type. - Cache hit / miss rate.
internal/services/cache_service.gohas no counters. Add a Counter with labels{operation=get|set, result=hit|miss}around the cache wrapper. ~10 lines. Useful once real traffic flows to verify the ETag and Redis caches are paying their keep. - APNs send-latency histogram. Wrap
internal/push/apns.go::Sendin aprometheus.NewHistogramVeckeyed on outcome. Tells you when Apple's gateway is slow (which correlates with their incident page).
12. Audit trail
| Date | Change |
|---|---|
| 2026-04-24 | Initial k3s cluster on Hetzner (Swarm → k3s migration) — see MIGRATION_NOTES.md |
| 2026-04-25 | config.yaml reconstructed from live ConfigMap (original file lost) |
| 2026-05-15 | Audit fixes: Redis auth required, admin basic auth, secrets-encryption flag |
| 2026-05-16 | 02-setup-secrets.sh started carrying B2 credentials (was a manifest/script drift) |
| 2026-06-02 | Kratos scaffolding committed (not deployed) |
| 2026-06-03 | Hetzner → OVH BHS cutover. New 3-node cluster on 51.81.83.33, .87.86, .85.248. DNS cut on Cloudflare. Hetzner kubeconfig moved to .bak. Grafana honeydue-eli5-overview dashboard created. Hetzner cluster powered off later same day. |
| 2026-06-03 | Dashboard build-out: extended honeydue-eli5-overview to 22 panels covering Tier-1 (HTTP status, CPU per pod, goroutines, top slow) and Tier-2 (GC, network I/O, pod uptime, top 5xx) signals. Surfaced Tier-3 instrumentation gaps in §11.1. |