From e448ec66dc8659cb0b49b190c211118cefa0f217 Mon Sep 17 00:00:00 2001 From: Trey t Date: Wed, 3 Jun 2026 09:34:35 -0500 Subject: [PATCH] docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover: - Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions, network/firewall, DNS, filesystem layout — all reflect the live OVH install instead of the historical Hetzner setup. - Section 6: canonical install-from-clean-boxes procedure (the literal commands run on 2026-06-03), so anyone can stand up a backup cluster by following along. - Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away, healthy-but-empty) and adds four new ones discovered during the OVH build: rbac.yaml not in 03-deploy.sh, namespace label missing from api metrics (use service="api"), cluster-label collision when two clusters push concurrently, worker double-firing on cutover. - Section 11.1: enumerates Tier-3 observability gaps surfaced while building the honeydue-eli5-overview dashboard (node-exporter not deployed, Traefik metrics off, push success counters absent, worker /metrics endpoint absent, cache hit rate uninstrumented, APNs latency uninstrumented). - Section 12: dated audit trail of cluster changes. Pure documentation; no code or manifest changes. Co-Authored-By: Claude Opus 4.7 --- deploy-k3s/RUNBOOK.md | 1086 +++++++++++++++++++++++++++++++++-------- 1 file changed, 895 insertions(+), 191 deletions(-) diff --git a/deploy-k3s/RUNBOOK.md b/deploy-k3s/RUNBOOK.md index 53c4b9b..0deffa7 100644 --- a/deploy-k3s/RUNBOOK.md +++ b/deploy-k3s/RUNBOOK.md @@ -1,74 +1,764 @@ -# k3s Cluster Operations Runbook +# honeyDue k3s Cluster — Operations Runbook -Living document for honeyDue k3s cluster operations. Add entries when you -hit something non-obvious so future-you (or your replacement) doesn't have -to rediscover it. +Living document for the honeyDue production cluster. Add entries when you hit +something non-obvious so future-you (or your replacement) doesn't have to +rediscover it. + +Last full revision: **2026-06-03** (Hetzner → OVH BHS cutover; cluster solo +production from that date forward). For pre-OVH history, see +`MIGRATION_NOTES.md` (Swarm → k3s migration on Hetzner, 2026-04-24). --- -## Deployment +## 1. Topology and inventory -The canonical deploy path is `deploy-k3s/scripts/03-deploy.sh`. It applies -everything in `deploy-k3s/manifests/` in the right order. +### Hosting -What it touches (in order): +| | | +|---|---| +| Provider | OVHcloud (us.ovhcloud.com) | +| Datacenter | BHS — Beauharnois, Quebec, Canada | +| Plan | VPS-1 × 3 (~$6.46/mo each, ~$19/mo total) | +| Node spec | 4 vCPU (Intel Haswell, shared), 7.6 GB RAM, 75 GB NVMe | +| Public bandwidth | 400 Mbps per node, unlimited traffic | +| Private network | **None.** Nodes have public IPv4 + IPv6 only; inter-node traffic crosses the public internet (encrypted by flannel WireGuard backend — see §3) | -1. `namespace.yaml` -2. `network-policies.yaml` — **all** NetPols including the vmagent ones -3. `redis/` -4. `ingress/` -5. `migrate/job.yaml` (with image substitution; blocks on success) -6. `api/deployment.yaml`, `api/service.yaml`, `api/hpa.yaml` (image-subbed) -7. `worker/deployment.yaml` (image-subbed) -8. `admin/deployment.yaml`, `admin/service.yaml` (image-subbed) -9. `web/deployment.yaml`, `web/service.yaml` (image-subbed; optional dir) -10. `observability/kube-state-metrics.yaml` -11. `observability/vmagent.yaml` (with `TOKEN_PLACEHOLDER` sed-substituted from `deploy/prod.env`) +### Nodes -If you add a new manifest, also add a `kubectl apply -f` line to -`03-deploy.sh` — there's no kustomization or `apply -R`. **A manifest -that exists in the repo but isn't applied by the script will silently -not deploy.** +| SSH alias | Kubernetes node name | Public IPv4 | Public IPv6 | Roles | +|---|---|---|---|---| +| `ovhcloud1` | `vps-1624d691` | `51.81.83.33` | `2604:2dc0:101:200::5a9a` | control-plane, etcd, redis-pinned | +| `ovhcloud2` | `vps-c0f51be2` | `51.81.87.86` | `2604:2dc0:101:200::30d4` | control-plane, etcd | +| `ovhcloud3` | `vps-dbca24c7` | `51.81.85.248` | `2604:2dc0:101:200::450f` | control-plane, etcd | -### Pre-deploy checklist +The cluster is **all-control-plane** (workloads schedule on the same nodes that +run etcd and the API server). `vps-1624d691` carries the +`honeydue/redis=true` label so the Redis Deployment's `nodeSelector` binds +there; the Redis PVC (`local-path`, host-pinned) lives on that node's disk. -- [ ] `deploy/prod.env` exists and contains `OBS_INGEST_TOKEN=...` - (otherwise vmagent gets skipped with a warning) -- [ ] `KUBECONFIG` points at the right cluster -- [ ] The Gitea image registry is reachable from k3s nodes -- [ ] Schema migrations in `migrations/` are tested locally first - (the deploy aborts if `honeydue-migrate` Job fails) +### SSH access + +`~/.ssh/config` entries (operator workstation): + +``` +Host ovhcloud1 + HostName 51.81.83.33 + Port 22 + User ubuntu + IdentityFile ~/.ssh/ovhcloud + IdentitiesOnly yes +Host ovhcloud2 + HostName 51.81.87.86 + Port 22 + User ubuntu + IdentityFile ~/.ssh/ovhcloud + IdentitiesOnly yes +Host ovhcloud3 + HostName 51.81.85.248 + Port 22 + User ubuntu + IdentityFile ~/.ssh/ovhcloud + IdentitiesOnly yes +``` + +`ubuntu` has passwordless sudo (`/etc/sudoers.d/90-cloud-init-users` from OVH's +cloud-init). + +### kubectl access + +```bash +export KUBECONFIG=/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/deploy-k3s/kubeconfig +kubectl get nodes +``` + +The `deploy-k3s/kubeconfig` file (mode 0600, gitignored) is the OVH cluster's +admin kubeconfig with `server: https://51.81.83.33:6443`. A stale Hetzner copy +lives next to it as `kubeconfig.hetzner.bak` for historical reference; the +Hetzner cluster is powered off and that file's API server is unreachable. + +To refresh from the cluster (if the local copy is lost or rotated): + +```bash +ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \ + | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \ + > deploy-k3s/kubeconfig +chmod 600 deploy-k3s/kubeconfig +``` + +The k3s API at `:6443` is open to the public internet (token-protected). --- -## Known gotchas +## 2. Software -### vmagent SD broken on fresh deploy ("0 pods up" in Grafana) +### Kernel-level + +| | | +|---|---| +| OS | Ubuntu 26.04 LTS (set by OVH's VPS-1 image) | +| Kernel | `7.0.0-14-generic` | +| Init | systemd | +| Container runtime | containerd 2.2.2 (bundled with k3s) | +| Firewall | `ufw` (per-node, configured at install — see §3) | +| Other host packages | `fail2ban` (SSH brute-force protection, default jail), `unattended-upgrades` (security updates), `open-iscsi` (k3s prereq for some storage backends), `curl` | + +### Kubernetes + +| | | +|---|---| +| Distribution | k3s | +| Version | **`v1.34.6+k3s1`** (pinned in `config.yaml:cluster.k3s_version`) | +| Control plane | 3-node HA, embedded etcd (no external Postgres backing store) | +| CNI / networking | flannel with **WireGuard-native backend** (`--flannel-backend=wireguard-native`). Encrypts pod-to-pod and etcd peer traffic because nodes only have public IPs (no private network). ~3-5% CPU overhead under load. | +| Service LB | klipper-lb (default k3s `servicelb`). The `svclb-traefik` DaemonSet binds host ports `:80` and `:443` on each node and forwards to the Traefik Service. **Not** the DaemonSet-w/-hostNetwork Traefik pattern used on the old Hetzner cluster — see §10 *Differences from MIGRATION_NOTES*. | +| Ingress controller | Traefik (k3s default), single-replica Deployment, exposed via klipper-lb | +| DNS | CoreDNS (k3s default) | +| Secrets encryption | Enabled (`--secrets-encryption`); etcd values are AES-CBC encrypted at rest | +| kubeconfig perms | `0600` (`--write-kubeconfig-mode=0600`) | +| Cloud controller | Disabled (`--disable-cloud-controller`) — no provider integration on OVH | +| Misc | `--node-ip` / `--node-external-ip` / `--advertise-address` all set to each node's public IPv4. TLS SANs cover all 3 IPs so any IP can serve the API. | + +### Application stack (in cluster, `honeydue` namespace) + +| Deployment | Replicas | Image (digest-pinned) | Notes | +|---|---:|---|---| +| `api` | 3 | `gitea.treytartt.com/admin/honeydue-api@sha256:34fde6...` | Go REST API on `:8000`, exposes `/metrics` | +| `web` | 3 | `gitea.treytartt.com/admin/honeydue-web@sha256:8c62cf...` | Next.js, server-side proxy to api | +| `admin` | 1 | `gitea.treytartt.com/admin/honeydue-admin@sha256:b81263...` | Next.js admin panel, gated behind Traefik basic-auth | +| `worker` | 1 | `gitea.treytartt.com/admin/honeydue-worker@sha256:fe1f5e...` | Asynq scheduler + Redis-backed jobs (singleton — must not run as >1 replica or every cron fires N×) | +| `redis` | 1 | `redis:7-alpine@sha256:6ab0b6...` | Pinned to `vps-1624d691` via `honeydue/redis=true`. PVC `redis-data` (local-path, 5 Gi). Password-auth required. | +| `vmagent` | 1 | `victoriametrics/vmagent@sha256:...` (default tag) | Scrapes api `/metrics` + kube-state-metrics; remote-writes to obs.88oakapps.com | +| `kube-state-metrics` | 1 | `kube-state-metrics@sha256:...` | In `kube-system`, scraped by vmagent for `kube_*` cluster-state metrics | +| `alloy-logs` (DaemonSet) | 3 (1/node) | `grafana/alloy@sha256:...` | Tails `/var/log/pods/*` and ships to Loki at obs.88oakapps.com | + +The Asynq scheduler inside `worker` registers these cron jobs: + +| Cron | Job | Notes | +|---|---|---| +| `0 * * * *` | Smart reminder check (per-user hour) | Default user hour: 14:00 UTC | +| `0 * * * *` | Daily digest check (per-user hour) | Default user hour: 03:00 UTC | +| `0 10 * * *` | Onboarding emails | 10:00 UTC | +| `0 3 * * *` | Reminder log cleanup | 03:00 UTC | +| `30 * * * *` | Pending uploads cleanup | xx:30 every hour | + +### External dependencies + +| Service | Endpoint | Purpose | Failure mode | +|---|---|---|---| +| Neon Postgres | `ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech:5432` | App data. Pooler endpoint (transaction-mode PgBouncer in front of Neon compute) so connections stay warm. | api / worker pods crash-loop with `dial tcp: connection refused`. Health endpoint returns `postgres: error`. | +| Backblaze B2 (S3-compatible) | `s3.us-east-005.backblazeb2.com` (bucket `honeyDueProd`) | User uploads (photos, PDFs, completion attachments) | Upload routes return 5xx; reads of cached/static files still work. | +| Cloudflare | `myhoneydue.com` zone | DNS + TLS termination + edge cache + DDoS | Traffic stops reaching origin. Direct `https://51.81.x.x` still works for diagnostics. | +| obs.88oakapps.com | Operator-run Grafana + VictoriaMetrics + Loki | Metrics & logs | vmagent + alloy-logs back off and retry. No app-side impact. | +| Apple APNs | `api.push.apple.com:443` (production) | iOS push notifications | Push fails; circuit breaker opens; failure logged. App functionality unaffected. | +| Fastmail SMTP | `smtp.fastmail.com:587` | Transactional emails (verification, recovery, digests) | Email send fails in the worker; logged; user reset/digest flow degrades. | +| Gitea registry | `gitea.treytartt.com` | Container image registry | Deploys can't pull. Existing pods keep running on cached images. | + +--- + +## 3. Network and firewall + +### Per-node `ufw` configuration + +Applied during install (same on all 3 nodes): + +``` +default deny incoming +default allow outgoing +allow 22/tcp (SSH, world) +allow 80/tcp (HTTP via Cloudflare, world — see GAP-1) +allow 443/tcp (HTTPS, same — GAP-1) +allow 6443/tcp (k3s API, world, token-protected) +allow 2379:2380/tcp from (etcd client + peer) +allow 10250/tcp from (kubelet) +allow 51820/udp from (WireGuard tunnel) +allow 8472/udp from (VXLAN, defense-in-depth fallback) +``` + +To inspect: `ssh ovhcloudN sudo ufw status numbered`. + +### Cluster networking + +- **Pod CIDR**: `10.42.0.0/16` (default k3s) +- **Service CIDR**: `10.43.0.0/16` (default k3s) +- **Flannel backend**: WireGuard-native. Each node hosts a `flannel-wg` interface on UDP 51820 and tunnels pod traffic to peers. Verify: `ssh ovhcloudN ip -d link show flannel-wg`. + +### Traefik ingress flow + +``` +Cloudflare → node:80/443 (public) + → klipper-lb svclb-traefik DaemonSet pod (hostPort:80/443) + → Traefik Service (ClusterIP 10.43.245.127:80/443) + → Traefik Deployment pod (single replica) + → matches Ingress host rule (api.myhoneydue.com etc.) + → routes to backend Service (api / web / admin) + → backend Pod +``` + +The Traefik default also lives in `kube-system` and is managed by k3s's +HelmChart. **No HelmChartConfig override is applied on OVH** (unlike Hetzner +— see §10). + +--- + +## 4. DNS configuration (Cloudflare) + +The `myhoneydue.com` zone in Cloudflare has these public records. **All +hostnames are proxied (orange cloud)** — required by the `cloudflare-only` +Traefik middleware which 403s any non-CF source IP. + +| Host | Type | Values | Proxy | +|---|---|---|---| +| `api.myhoneydue.com` | A × 3 | `51.81.83.33`, `51.81.87.86`, `51.81.85.248` | Proxied | +| `app.myhoneydue.com` | A × 3 | (same trio) | Proxied | +| `admin.myhoneydue.com` | A × 3 | (same trio) | Proxied | +| `myhoneydue.com` (apex `@`) | A × 3 | (same trio) | Proxied | + +Cloudflare round-robins among the 3 origins, klipper-lb on whichever node CF +hits forwards to Traefik, and Traefik routes by Host header. Per-request, +effectively load-balanced across the 3 nodes for ingress, with no central LB. + +**SSL/TLS mode**: Flexible (CF terminates TLS at the edge; origin is plain +HTTP on `:80`). Upgrading to Full (strict) is on the deferred list — would +need an origin certificate provisioned to `cloudflare-origin-cert` secret and +Traefik configured for TLS termination. + +--- + +## 5. Filesystem layout (`deploy-k3s/`) + +``` +deploy-k3s/ +├── config.yaml # Single config source (gitignored; contains tokens) +├── config.yaml.example # Template +├── kubeconfig # OVH admin kubeconfig (gitignored, 0600) +├── kubeconfig.hetzner.bak # Old Hetzner kubeconfig (unreachable, kept for history) +├── kubeconfig.tunnel # Optional: localhost-pointing copy for SSH-tunnel use +├── secrets/ +│ ├── README.md +│ ├── postgres_password.txt # Neon DB password +│ ├── secret_key.txt # 32+ char app-token signing secret +│ ├── email_host_password.txt # Fastmail SMTP app password +│ ├── fcm_server_key.txt # FCM server key (currently unused — Android push disabled) +│ ├── apns_auth_key.p8 # APNs auth key (binary) +│ ├── cloudflare-origin.crt # Origin certificate (currently unused — CF Flexible) +│ └── cloudflare-origin.key +│ (all gitignored except README.md) +├── manifests/ +│ ├── namespace.yaml +│ ├── network-policies.yaml # default-deny + per-app egress/ingress (13 NetPols total) +│ ├── rbac.yaml # api/worker/admin/web/redis ServiceAccounts (NOT applied by 03-deploy.sh; manual once) +│ ├── pod-disruption-budgets.yaml # api-pdb, web-pdb, worker-pdb (NOT applied by 03-deploy.sh; manual once) +│ ├── traefik-helmchartconfig.yaml # Hetzner-only DaemonSet+hostNetwork override (do NOT apply on OVH; we use default klipper-lb) +│ ├── kyverno-verify-images.yaml # Operator-gated policy (do NOT apply blindly — see file comment) +│ ├── api/{deployment,service,hpa}.yaml +│ ├── worker/deployment.yaml +│ ├── admin/{deployment,service}.yaml +│ ├── web/{deployment,service}.yaml +│ ├── redis/{deployment,service,pvc}.yaml +│ ├── ingress/{middleware,ingress-simple}.yaml +│ ├── migrate/job.yaml # goose migration Job (image-subbed at deploy time) +│ ├── observability/{kube-state-metrics,vmagent,alloy-logs}.yaml +│ └── kratos/ # Ory Kratos identity service (NOT yet deployed; gated on operator OIDC setup) +└── scripts/ + ├── _config.sh # Sourced by all scripts: cfg(), generate_env(), generate_cluster_config() + ├── 01-provision-cluster.sh # Hetzner-Cloud-specific (uses hetzner-k3s CLI) — DO NOT RUN ON OVH + ├── 02-setup-secrets.sh # Creates honeydue-secrets etc. from secrets/ + config.yaml; kubeconfig-driven + ├── 03-deploy.sh # Build + push + apply manifests + roll deployments; kubeconfig-driven + ├── 04-verify.sh # Post-deploy health + security checks; kubeconfig-driven + └── rollback.sh # `kubectl rollout undo` across all deployments +``` + +The `deploy/prod.env` file (sibling to `deploy-k3s/`, gitignored) holds +observability + admin credentials that `02/03-deploy.sh` read but never +display: + +``` +OBS_INGEST_URL (https://obs.88oakapps.com/api/v1/write) +OBS_TRACES_URL (https://obs.88oakapps.com/v1/traces) +OBS_INGEST_TOKEN (bearer token for VM + Loki + traces — all use same token) +GRAFANA_URL (https://grafana.88oakapps.com) +GRAFANA_ADMIN_USER (admin) +GRAFANA_ADMIN_PASSWORD +ADMIN_EMAIL / ADMIN_PASSWORD (in-app admin login) +``` + +--- + +## 6. Install from clean boxes — the truthful procedure + +This is what we ran on 2026-06-03 to stand up the live cluster, exactly. If +you ever rebuild from zero this is the canonical sequence. Total wall-clock: +~12 min for cluster bootstrap; ~10 min for workloads. + +### 6.1 Prerequisites + +- 3 fresh Ubuntu VPS instances (any provider with public IPv4, ≥4 GB RAM, + ≥40 GB disk) +- `~/.ssh/config` entries (`ovhcloud1/2/3`) pointing at them, with + passwordless sudo +- Local `kubectl` and `curl` +- The repo's `deploy-k3s/secrets/` populated (or the ability to copy live + secrets from another running cluster — see §7.2) +- `deploy/prod.env` populated with obs token + Grafana creds + +### 6.2 Per-node OS hardening + firewall (all 3 in parallel) + +For each `ovhcloudN`, over SSH: + +```sh +export DEBIAN_FRONTEND=noninteractive +sudo apt-get update -qq +sudo apt-get install -y -qq fail2ban unattended-upgrades open-iscsi curl ufw +sudo systemctl enable --now iscsid fail2ban +sudo dpkg-reconfigure -f noninteractive -plow unattended-upgrades + +sudo ufw --force reset +sudo ufw default deny incoming +sudo ufw default allow outgoing +sudo ufw allow 22/tcp +sudo ufw allow 80/tcp +sudo ufw allow 443/tcp +sudo ufw allow 6443/tcp +SELF=$(hostname -I | awk '{print $1}') +for peer in 51.81.83.33 51.81.87.86 51.81.85.248; do + [ "$peer" = "$SELF" ] && continue + sudo ufw allow from "$peer" to any port 2379:2380 proto tcp + sudo ufw allow from "$peer" to any port 10250 proto tcp + sudo ufw allow from "$peer" to any port 51820 proto udp + sudo ufw allow from "$peer" to any port 8472 proto udp +done +sudo ufw --force enable +``` + +**Watch ordering:** `allow 22/tcp` MUST precede `ufw enable`. Existing SSH +sessions survive (`ufw` only affects new connections), but a misordered script +locks you out of fresh logins. + +### 6.3 Install k3s on `ovhcloud1` (the init node) + +```sh +ssh ovhcloud1 'curl -sfL https://get.k3s.io | \ + INSTALL_K3S_VERSION=v1.34.6+k3s1 \ + sh -s - server \ + --cluster-init \ + --node-ip=51.81.83.33 \ + --node-external-ip=51.81.83.33 \ + --advertise-address=51.81.83.33 \ + --flannel-backend=wireguard-native \ + --flannel-external-ip \ + --secrets-encryption \ + --write-kubeconfig-mode=0600 \ + --tls-san=51.81.83.33 \ + --tls-san=51.81.87.86 \ + --tls-san=51.81.85.248 \ + --disable-cloud-controller' +``` + +Wait for `sudo k3s kubectl get nodes` to show this node Ready (~2-5 s). +Read the cluster token: + +```sh +ssh ovhcloud1 'sudo cat /var/lib/rancher/k3s/server/node-token' +``` + +### 6.4 Join `ovhcloud2`, then `ovhcloud3` (sequential) + +Joining etcd one node at a time avoids split-brain on slow networks. +Replace `` with the value from 6.3. + +For `ovhcloud2`: + +```sh +ssh ovhcloud2 'curl -sfL https://get.k3s.io | \ + INSTALL_K3S_VERSION=v1.34.6+k3s1 \ + K3S_TOKEN= \ + sh -s - server \ + --server=https://51.81.83.33:6443 \ + --node-ip=51.81.87.86 \ + --node-external-ip=51.81.87.86 \ + --advertise-address=51.81.87.86 \ + --flannel-backend=wireguard-native \ + --flannel-external-ip \ + --secrets-encryption \ + --write-kubeconfig-mode=0600 \ + --tls-san=51.81.83.33 --tls-san=51.81.87.86 --tls-san=51.81.85.248 \ + --disable-cloud-controller' +``` + +Then identical for `ovhcloud3` with `--node-ip=51.81.85.248` and +`--advertise-address=51.81.85.248`. After each, wait for `kubectl get nodes` +to show the new node Ready before proceeding. + +### 6.5 Pull kubeconfig to the operator workstation + +```sh +ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \ + | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \ + > deploy-k3s/kubeconfig +chmod 600 deploy-k3s/kubeconfig +export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig +kubectl get nodes -o wide # All 3 Ready, INTERNAL-IP = public IP +``` + +### 6.6 Label the redis node + +```sh +kubectl label node vps-1624d691 honeydue/redis=true --overwrite +``` + +(Use whichever k8s node name corresponds to `ovhcloud1`. The Redis +Deployment's `nodeSelector` binds to this label.) + +### 6.7 Bootstrap manifests NOT applied by `03-deploy.sh` + +These must be applied manually on a fresh cluster, **before** running +`03-deploy.sh`, or workloads will fail to schedule: + +```sh +kubectl apply -f deploy-k3s/manifests/rbac.yaml +kubectl apply -f deploy-k3s/manifests/pod-disruption-budgets.yaml +``` + +`rbac.yaml` creates the 5 ServiceAccounts (`api`, `worker`, `admin`, `web`, +`redis`) referenced by the Deployment manifests. Without these, ReplicaSets +hang on `FailedCreate: error looking up service account` and pods never +start. Symptom on first deploy: `kubectl get deploy` shows `0 up-to-date` +across the board with no pod activity — see §9 *Gotchas*. + +**Do NOT apply** `traefik-helmchartconfig.yaml` (Hetzner-only — see §10) or +`kyverno-verify-images.yaml` (gated on operator Kyverno install). + +### 6.8 Seed secrets + +Two paths; pick whichever fits your situation: + +**Path A — clean install from local files** (the original design): + +```sh +KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/02-setup-secrets.sh +``` + +Requires `deploy-k3s/secrets/` to contain real `postgres_password.txt`, +`secret_key.txt`, `email_host_password.txt`, `fcm_server_key.txt`, +`apns_auth_key.p8`, `cloudflare-origin.crt`, `cloudflare-origin.key`. The +script reads `config.yaml` for `registry.*`, `redis.password`, +`admin.basic_auth_*`, and `storage.b2_*`. + +**Path B — clone live secrets from another running cluster** (what we +actually did during the migration; useful if `secrets/` is empty or you want +exact-byte equivalence): + +```sh +HETZNER=$(pwd)/deploy-k3s/kubeconfig.hetzner.bak # or any kubeconfig with the secrets +OVH=$(pwd)/deploy-k3s/kubeconfig +kubectl --kubeconfig=$OVH apply -f deploy-k3s/manifests/namespace.yaml +for S in honeydue-secrets honeydue-apns-key gitea-credentials cloudflare-origin-cert admin-basic-auth; do + kubectl --kubeconfig=$HETZNER -n honeydue get secret $S -o json \ + | python3 -c " +import json, sys +d = json.load(sys.stdin) +m = d['metadata'] +for k in ('uid','resourceVersion','creationTimestamp','generation','managedFields','ownerReferences','selfLink'): + m.pop(k, None) +m.pop('annotations', None) +print(json.dumps(d))" \ + | kubectl --kubeconfig=$OVH apply -f - +done +``` + +After either path, verify: + +```sh +kubectl -n honeydue get secrets +# Expect: admin-basic-auth, cloudflare-origin-cert, gitea-credentials, +# honeydue-apns-key, honeydue-secrets +``` + +### 6.9 Deploy workloads + +```sh +KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig \ + ./deploy-k3s/scripts/03-deploy.sh --skip-build --tag latest +``` + +- `--skip-build` skips Docker build + push, deploys whatever's already in the + registry at the named tag. Use this when migrating between clusters to + guarantee both run identical bits. +- Without flags it builds the api / worker / admin / web images from the + local repo HEAD and pushes to `gitea.treytartt.com` first. +- The script applies (in order): namespace, network-policies (13 of them), + redis, ingress, then runs the goose migration Job (blocking on success), + then api / worker / admin / web Deployments, then observability + (kube-state-metrics, vmagent, alloy-logs). +- It does NOT apply: `rbac.yaml`, `pod-disruption-budgets.yaml`, + `traefik-helmchartconfig.yaml`, `kyverno-verify-images.yaml`. The first + two must be applied manually (see §6.7); the latter two are Hetzner-only + or operator-gated. +- It does NOT apply: anything under `kratos/` (skipped until + `kratos-secrets` exists, which requires real OIDC client IDs). + +### 6.10 Verify + +```sh +KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/04-verify.sh +``` + +Expect: all deployments `READY=desired`, 13 NetworkPolicies, 7 ServiceAccounts +(api, worker, admin, web, redis, vmagent, alloy-logs), 3 PDBs, cloudflare-only +middleware present, in-cluster `/api/health/` returns 200. + +External smoke test (DNS-aware, but the api `/health/` route is exempt from +the cloudflare-only middleware so direct-IP works for diagnostics): + +```sh +for IP in 51.81.83.33 51.81.87.86 51.81.85.248; do + curl -s -o /dev/null -w "$IP -> %{http_code}\n" \ + -H 'Host: api.myhoneydue.com' http://$IP/api/health/ +done +# All three should return 200. +``` + +### 6.11 DNS cutover (if migrating) + +In the Cloudflare dashboard for `myhoneydue.com`, set the 4 hostnames in §4 to +the OVH IPs and keep proxied. Effective propagation ~30 s to 5 min through +the Cloudflare proxy. + +If you have a previous cluster, **scale its worker to 0 before flipping** to +avoid scheduled-job double-fires: + +```sh +KUBECONFIG= kubectl -n honeydue scale deploy/worker --replicas=0 +# (cut DNS) +KUBECONFIG= kubectl -n honeydue scale deploy/worker --replicas=1 +``` + +Run those last two lines back-to-back. Worker work is mostly scheduled +(hourly+), so a brief gap is harmless; overlap would cause duplicate emails. + +--- + +## 7. Day-to-day operations + +### Common kubectl one-liners + +```sh +export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig + +# Cluster state +kubectl get nodes -o wide +kubectl -n honeydue get pods +kubectl -n honeydue get deploy +kubectl top nodes +kubectl -n honeydue top pods + +# Tail logs +kubectl -n honeydue logs deploy/api -f --tail=50 +kubectl -n honeydue logs -l app.kubernetes.io/name=api -f --tail=20 +stern -n honeydue api # if stern is installed (multi-pod) + +# Restart a deployment (no image change, picks up ConfigMap changes) +kubectl -n honeydue rollout restart deploy/api + +# Rollback one revision +kubectl -n honeydue rollout undo deploy/api + +# Scale (worker MUST stay at 0 or 1) +kubectl -n honeydue scale deploy/api --replicas=4 + +# Get into a pod +kubectl -n honeydue exec -it deploy/api -- sh +``` + +### Redeploy after code changes + +```sh +KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/03-deploy.sh +``` + +Builds images from local HEAD, tags with the git short SHA, pushes to Gitea, +runs `goose up` (idempotent), rolls api/worker/admin/web. Total: ~3-5 min +when images change. + +To deploy without rebuilding (pin to a specific tag): + +```sh +./deploy-k3s/scripts/03-deploy.sh --skip-build --tag +``` + +### Migrations + +Goose migrations live in `migrations/`. New file pattern: + +``` +make migrate-new name=add_foo_column # generates migrations/YYYYMMDDHHMMSS_add_foo_column.sql +# Edit the file with -- +goose Up / -- +goose Down sections +``` + +`03-deploy.sh` runs a one-shot Job (`manifests/migrate/job.yaml`) that +executes `goose up` against Neon (direct compute endpoint, not pooler — see +file comment). The Job blocks api/worker rollout and aborts the deploy on +failure. No app pod runs `AutoMigrate`; api/worker startup verifies +`goose_db_version` is current and refuses to boot on mismatch. + +### Grafana + +URL: https://grafana.88oakapps.com (creds in `deploy/prod.env`) + +Three dashboards in the `honeyDue` folder: + +| UID | Title | Use | +|---|---|---| +| `honeydue-eli5-overview` | honeyDue — Overview (ELI5) | Single-screen at-a-glance health: pods up, crashes, errors, RPS, latency, Postgres, memory, top endpoints, push failures, worker activity, recent error logs. Created 2026-06-03. | +| `honeydue-red` | honeyDue API — RED | Rate/Errors/Duration cuts (legacy) | +| `honeydue-logs` | honeyDue — Production Logs | Live log explorer | + +For the ELI5 dashboard's queries, **api-side metrics use `service="api"`, +NOT `namespace="honeydue"`.** vmagent's scrape config drops the namespace +label from api metrics — only `service`, `pod`, `node`, `job`, plus the +metric's own labels (route, method, status, etc.) survive. Queries that +filter on `namespace="honeydue"` for api metrics silently match nothing. + +### kubectl tunnel (if 6443 is firewalled to your IP) + +Currently `6443` is open WAN-side (matching the previous Hetzner posture). +If you tighten that to operator-IPs-only and your IP changes, use an SSH +tunnel: + +```sh +ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \ + -i ~/.ssh/ovhcloud \ + -L 127.0.0.1:6443:127.0.0.1:6443 \ + ubuntu@51.81.83.33 + +cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel +sed -i.bak 's|https://51.81.83.33:6443|https://127.0.0.1:6443|' deploy-k3s/kubeconfig.tunnel +export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel" +``` + +--- + +## 8. Disaster recovery + +### "I lost the kubeconfig" + +```sh +ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \ + | sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \ + > deploy-k3s/kubeconfig +chmod 600 deploy-k3s/kubeconfig +``` + +If `ovhcloud1` is down but `ovhcloud2` or `3` is up, swap host and IP — the +TLS SAN covers all three. + +### "A node is unresponsive" + +```sh +kubectl drain vps-XXX --ignore-daemonsets --delete-emptydir-data +# Reboot via OVH manager or: +ssh ovhcloudN sudo reboot +# Wait for Ready, then: +kubectl uncordon vps-XXX +``` + +The cluster tolerates 1 node down (etcd quorum 2/3). With 2 down, etcd +loses quorum and the API server stops accepting writes. + +### "etcd quorum lost (2+ nodes dead)" + +Bring nodes back online if possible. If not: + +```sh +ssh ovhcloud1 'sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/' +``` + +k3s takes automatic etcd snapshots every 12h, keeping 5. List with: + +```sh +ssh ovhcloud1 sudo ls -la /var/lib/rancher/k3s/server/db/snapshots/ +``` + +This is destructive — workload state since the snapshot is lost, but Neon +(actual app data) is unaffected. + +### "I have to rebuild the whole cluster from scratch" + +Provision 3 fresh boxes, then exactly the sequence in §6. End-to-end is +~30 min. The dependencies that make this possible: + +| Stays put through rebuild | Where | +|---|---| +| Application data | Neon Postgres (managed) | +| User uploads | Backblaze B2 (managed) | +| Container images | `gitea.treytartt.com` (self-hosted, but not on the OVH cluster) | +| Operator secrets | `deploy-k3s/secrets/` + `config.yaml` + `deploy/prod.env` on the operator workstation (gitignored) | +| DNS | Cloudflare control panel | + +If `gitea.treytartt.com` is on the same OVH cluster, you have a circular +dependency — rebuilding requires images you can't pull until the cluster is +up. Currently Gitea is NOT in the honeyDue cluster (separate Hetzner-era +host), so this isn't a problem today, but worth flagging if that ever +changes. + +### "Cutover back to Hetzner / failover to a backup cluster" + +There is **no warm standby today.** Bringing up a second cluster is the +same §6 procedure on different hardware, then a Cloudflare DNS swap. The +worker-swap dance is critical: + +```sh +KUBECONFIG= kubectl -n honeydue scale deploy/worker --replicas=0 +# (Update Cloudflare DNS to new cluster's IPs — proxied) +KUBECONFIG= kubectl -n honeydue scale deploy/worker --replicas=1 +``` + +--- + +## 9. Known gotchas + +### 9.1 First-deploy "0 up-to-date" across all Deployments + +**Symptoms:** `kubectl get deploy` shows `READY 0/N, UP-TO-DATE 0` for +api/worker/admin/web/redis. `kubectl get events` shows +`FailedCreate: error looking up service account honeydue/: serviceaccount "..." not found`. + +**Cause:** `rbac.yaml` (ServiceAccounts) is NOT applied by `03-deploy.sh`. On +a fresh cluster the SAs don't exist; the ReplicaSet controller can't create +pods. + +**Fix:** + +```sh +kubectl apply -f deploy-k3s/manifests/rbac.yaml +kubectl -n honeydue rollout restart deploy/api deploy/worker deploy/admin deploy/web deploy/redis +``` + +This was hit during the 2026-06-03 OVH bootstrap. Permanently fix by adding +`kubectl apply -f rbac.yaml` to `03-deploy.sh` between the namespace and +network-policies apply, but until that lands, follow §6.7 on every fresh +cluster. + +### 9.2 vmagent SD broken on fresh deploy ("0 pods up" in Grafana) **Symptoms:** - Grafana panels using `kube_*` metrics or `up{job=...}` show 0 -- vmagent logs: `dial tcp 10.43.0.1:443: connect: connection refused` - repeating every ~30s -- Direct test from a pod also refused: `kubectl -n honeydue exec deploy/vmagent - -- wget --no-check-certificate -qO- -T 3 https://10.43.0.1:443/livez` +- vmagent logs: `dial tcp 10.43.0.1:443: connect: connection refused` every ~30 s +- Direct test from a pod also refused -**Cause:** k3s's built-in NetworkPolicy controller evaluates egress rules -**after** kube-proxy's DNAT, not before (contrary to the k8s spec). -Traffic from a pod to the `kubernetes` Service (ClusterIP `10.43.0.1:443`) -gets DNAT'd to `:6443`, and **then** the policy check -runs. Without an explicit egress rule for `:6443`, the packet is rejected -with a TCP RST → "connection refused". +**Cause:** k3s's NetworkPolicy controller evaluates egress rules *after* +kube-proxy's DNAT (not before, contrary to spec). Pod-to-`kubernetes`-Service +(`10.43.0.1:443`) gets DNAT'd to `:6443`, *then* the policy check +runs. Without an explicit egress rule for `:6443`, the packet is rejected. The `allow-egress-from-vmagent` NetPol in `network-policies.yaml` includes both rules: ```yaml -# Pre-DNAT view (correct per spec; harmless if unused) - to: - ipBlock: { cidr: 10.43.0.0/16 } ports: - { port: 443, protocol: TCP } -# Post-DNAT path (what k3s NetPol enforcer actually sees) — REQUIRED - to: - ipBlock: cidr: 0.0.0.0/0 @@ -77,59 +767,40 @@ both rules: - { port: 6443, protocol: TCP } ``` -**If this happens on a fresh deploy:** confirm `network-policies.yaml` -was applied: -```bash -kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml +**If this happens:** confirm `network-policies.yaml` was applied: + +```sh +kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443 ``` -Look for the port-6443 egress rule. If missing, the apply step in -`03-deploy.sh` was skipped or the file was edited and the rule got -dropped. -**Counter-evidence that confirms diagnosis:** kube-state-metrics in -`kube-system` works fine, because `kube-system` has no NetPols. So if -ksm is healthy but workloads in `honeydue` can't reach the apiserver -ClusterIP, this gotcha is the cause. +Counter-evidence that confirms diagnosis: `kube-state-metrics` in +`kube-system` works fine (no NetPols in that namespace). ---- +### 9.3 vmagent appears healthy but no data in Grafana -### vmagent appears healthy but no data in Grafana +vmagent's `/-/healthy` returns 200 as long as the process is alive and +remote-write is TCP-functional. It doesn't check that scrapes are actually +*succeeding*. The liveness probe in `vmagent.yaml` queries `/api/v1/targets` +and fails the pod if no target is `up`. After ~3 failures (~3 min), kubelet +recycles it. -vmagent's `/-/healthy` endpoint returns 200 as long as the process is -alive and remote-write is functional (TCP-level) — it does **not** -check whether scrapes are succeeding. We saw this fail once: vmagent -was "healthy" for 17 days while having zero healthy targets due to a -broken k8s SD watch. +If vmagent runs for weeks but Grafana is empty, the probe was disabled or +the exec command broke. -The liveness probe in `vmagent.yaml` queries the agent's `/api/v1/targets` -endpoint and fails the pod if no target is in state `up`. After 3 -consecutive failures (~3 min), kubelet recycles the pod and SD restarts -clean. +### 9.4 vmagent bearer token destroyed by direct `kubectl apply` -**Verify it's working:** `kubectl -n honeydue describe pod -l app.kubernetes.io/name=vmagent` -should show `Liveness: exec [sh -c ...]`. If you ever see vmagent running -for weeks but no metrics in Grafana, the probe was disabled or the exec -command broke. +The committed `vmagent.yaml` has `bearer_token: TOKEN_PLACEHOLDER`. The real +token is `sed`-substituted at deploy time by `03-deploy.sh`. Applying the +file directly: ---- - -### vmagent's bearer token got blown away after `kubectl apply -f vmagent.yaml` - -The committed `vmagent.yaml` has `bearer_token: TOKEN_PLACEHOLDER`. The -real token is sed-substituted at deploy time by `03-deploy.sh`. If you -ever apply `vmagent.yaml` directly: - -```bash +```sh kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml # WRONG ``` -the Secret gets overwritten with the literal string `TOKEN_PLACEHOLDER` -and all remote-writes start returning 401 from obs.88oakapps.com. +overwrites the Secret with the literal `TOKEN_PLACEHOLDER` and remote-writes +401. To restore without a full redeploy: -**To restore without a full redeploy** (the safe inline path): - -```bash -export KUBECONFIG=... +```sh OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \ -o jsonpath='{.data.OBS_INGEST_TOKEN}') kubectl -n honeydue patch secret vmagent-remote-write --type=json \ @@ -137,126 +808,159 @@ kubectl -n honeydue patch secret vmagent-remote-write --type=json \ kubectl -n honeydue rollout restart deploy/vmagent ``` -The OBS token also lives in `honeydue-secrets.OBS_INGEST_TOKEN` because -the api pods use it for traces — same secret, same value. +Or just re-run `./deploy-k3s/scripts/03-deploy.sh` — the sed handles it. -**Or just re-run the deploy:** `./deploy-k3s/scripts/03-deploy.sh`. The -sed step handles the substitution correctly. +### 9.5 Dashboard queries: api metrics need `service="api"` not `namespace="honeydue"` + +vmagent's scrape config (`vmagent-config` ConfigMap) explicitly chooses which +Kubernetes pod-metadata labels to copy onto each scraped series. **Namespace +isn't one of them.** Labels you can use on api-side metrics: + +- `service` (literal `"api"`) +- `job` (literal `"api"`) +- `pod` (the api pod name) +- `node` (the k8s node name) +- `cluster` (vmagent external_label, currently `"honeydue-k3s"`) +- `environment` (vmagent external_label, currently `"prod"`) +- Plus each metric's own labels (`method`, `route`, `status` for HTTP; etc.) + +`kube_*` metrics from kube-state-metrics DO carry `namespace` natively +(KSM publishes it as a label, vmagent passes it through). Loki streams have +`namespace` because alloy-logs explicitly relabels it. So the rule is: + +| Metric prefix | Use | +|---|---| +| `kube_*` | `namespace="honeydue"` | +| `http_*`, `gorm_*`, `go_*`, `process_*` (api) | `service="api"` | +| Loki logs `{...}` | `namespace="honeydue"` | + +### 9.6 Cluster-label collision when two clusters run together + +Both Hetzner and OVH vmagents push as `cluster=honeydue-k3s, environment=prod` +(same external_labels). During the migration overlap this made dashboards +sum both clusters' data. The simplest narrowing during overlap is by node +name pattern (`node=~"vps-.*"` for OVH, `node=~"ubuntu-.*"` for Hetzner). If +you ever bring up a backup cluster long-term, change one cluster's +`external_labels.cluster` to something distinct (e.g. `honeydue-ovh` +vs. `honeydue-backup`). + +### 9.7 Worker double-firing scheduled jobs + +If two `worker` Deployments run concurrently (e.g. two clusters both pointing +at the same Neon DB), Asynq schedulers each fire crons independently — users +get duplicate emails. Workaround: scale all-but-one worker to 0. This is the +exact mechanic used during cutovers (§6.11). + +### 9.8 Node kubeconfig mode + +`/etc/rancher/k3s/k3s.yaml` on each node is mode `0600` because we install +with `--write-kubeconfig-mode=0600`. Tightening from k3s default (0644) was +intentional. Don't change without coordinating — any tooling on the node +that expects to read it (none today) will break. --- -### Node kubeconfig is world-readable +## 10. Differences from MIGRATION_NOTES.md (Hetzner-era) -`/etc/rancher/k3s/k3s.yaml` is mode `0644` per the `--write-kubeconfig-mode=644` -k3s install flag. Any process on the host (including any container that -mounts the host filesystem) can read full cluster-admin credentials. +`MIGRATION_NOTES.md` documents the Swarm → k3s migration on Hetzner +(2026-04-24). Most of it still applies, with these OVH-specific deltas: -This is intentional for the deploy user but worth knowing — any container -escape becomes immediate cluster-admin. Tracked as finding **F4** in -`k3_audit_5_12.md`. - -To tighten (if you ever turn this knob): change to `--write-kubeconfig-mode=600` -in the k3s install command, then re-fetch `deploy-k3s/kubeconfig`. +| What MIGRATION_NOTES says | What OVH actually has | +|---|---| +| `hetzner-k3s` provisioner | Manual k3s install (§6) | +| Hetzner Load Balancer (not used) → Cloudflare round-robin | Same — Cloudflare round-robin (§4) | +| Traefik as DaemonSet + hostNetwork via HelmChartConfig | Traefik default Deployment + klipper-lb svclb DaemonSet. The `traefik-helmchartconfig.yaml` file is **NOT applied** on OVH. | +| `servicelb` disabled (`--disable=servicelb`) | `servicelb` enabled (we didn't pass `--disable=servicelb`). This is what makes klipper-lb work. | +| sysctl `net.ipv4.ip_unprivileged_port_start=0` for hostNetwork Traefik | Not needed — klipper-lb proxies the port binding instead | +| UFW rules between 3 Hetzner IPs | UFW rules between 3 OVH IPs (51.81.83.33, 51.81.87.86, 51.81.85.248) | +| Kubeconfig at `~/.kube/honeydue-k3s.yaml` | Kubeconfig at `deploy-k3s/kubeconfig` | +| TLS at origin: not configured (CF Flexible) | Same — CF Flexible. `cloudflare-origin-cert` Secret exists (carried over) but Ingress doesn't reference it. | --- -## Common operations +## 11. Outstanding follow-ups (deferred, not blocking) -### Fetch a working kubectl tunnel (if `deploy-k3s/kubeconfig` is missing or stale) +1. **No warm standby / rollback cluster.** OVH is solo production. An OVH + outage is a real outage; recovery time = §6 procedure (~30 min). User + plans to bring a second cluster up as a target. +2. **UFW allows 80/443 from world.** Hetzner had a network-layer Cloudflare-IP + allowlist on these ports. OVH currently relies on the L7 + `cloudflare-only` Traefik middleware, which protects admin but NOT api / + web / apex (those routes have to be reachable from anywhere, but they're + then trivially DDoSable bypassing Cloudflare). Fix: add ufw allow rules + restricting `80/tcp` and `443/tcp` to Cloudflare's published IP ranges + (~22 IPv4 prefixes from https://www.cloudflare.com/ips-v4/). +3. **Cloudflare TLS Flexible → Full(strict).** Origin certs exist as Secret + but Ingress doesn't terminate TLS. Upgrading to Full(strict) requires + Traefik configured with the cert + an HTTPS entrypoint + Ingress + `tls:` block. +4. **`rbac.yaml` + `pod-disruption-budgets.yaml` should be in `03-deploy.sh`.** + They're currently bootstrap-only. Adding them is idempotent and prevents + the §9.1 footgun. +5. **Push notification metrics are log-derived, not counters.** Successes + aren't logged or counted. Proper Prometheus instrumentation (~15 lines in + `internal/push/client.go`) would give a real success/failure ratio. +6. **Worker has no `/metrics` endpoint.** `cmd/worker/main.go` serves `:6060` + for healthz only. Adding Asynq's `metrics.NewPrometheusExporter()` + a + ServiceMonitor + uncommenting the `worker` job stanza in + `vmagent-config` ConfigMap would give real queue depth and job latency. +7. **Ory Kratos.** Manifests exist (`manifests/kratos/`) but the deploy + is gated on operator-side prerequisites (Neon `kratos` database, + `auth.myhoneydue.com` DNS, real Apple+Google OIDC clients, Kratos image + tag pinned). Until `kratos-secrets` exists, `03-deploy.sh` silently + skips the Kratos apply. +8. **Hetzner cluster fully retired? `config.yaml` `nodes:` block describes + OVH; the bak kubeconfig is at `kubeconfig.hetzner.bak`. Boxes themselves + are operator-managed. -```bash -ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /etc/rancher/k3s/k3s.yaml' \ - | sed 's|server: https://127.0.0.1:6443|server: https://178.104.247.152:6443|' \ - > deploy-k3s/kubeconfig -chmod 600 deploy-k3s/kubeconfig -``` +### 11.1 Dashboard observability gaps (raised 2026-06-03 during dashboard build) -If the public :6443 is firewalled from your IP (the default — only -Cloudflare ranges are allowed for app traffic; admin is locked down): +Surfaced while building the `honeydue-eli5-overview` Grafana dashboard. Each +needs code or infra changes to expose; none blocks today's operations. -```bash -# SSH tunnel — leave running in another terminal -ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \ - -i ~/.ssh/hetzner \ - -L 127.0.0.1:6443:127.0.0.1:6443 \ - deploy@hetzner1 - -# Then use a kubeconfig pointing at localhost -cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel -sed -i.bak 's|https://178.104.247.152:6443|https://127.0.0.1:6443|' \ - deploy-k3s/kubeconfig.tunnel -export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel" -``` - -### Restore vmagent after a "0 targets" incident - -```bash -export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel" - -# 1. Confirm the diagnosis -kubectl -n honeydue logs deploy/vmagent --tail=20 | grep -i "connect: connection refused" - -# 2. Check the NetPol has the :6443 rule -kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443 - -# 3. If missing, re-apply -kubectl apply -f deploy-k3s/manifests/network-policies.yaml - -# 4. Restart vmagent -kubectl -n honeydue rollout restart deploy/vmagent - -# 5. Verify targets after ~60s -kubectl -n honeydue port-forward deploy/vmagent 8429:8429 & -curl -s http://localhost:8429/api/v1/targets \ - | python3 -c "import json,sys; d=json.load(sys.stdin); \ - a=d['data']['activeTargets']; \ - print(f'targets={len(a)} up={sum(1 for t in a if t[\"health\"]==\"up\")}')" -``` - -### Verify NetPols match the repo - -If you suspect drift between cluster and repo: - -```bash -diff <(kubectl -n honeydue get netpol -o name | sort) \ - <(grep -E '^\s*name: ' deploy-k3s/manifests/network-policies.yaml \ - | sed 's/.*name: /networkpolicy.networking.k8s.io\//' | sort) -``` - -Empty output = match. Any differences need investigation — either the -cluster has policies that aren't in repo (manual `kubectl apply` did it) -or repo has policies that didn't apply. +9. **node-exporter not deployed.** No node-level metrics today + (`node_filesystem_avail_bytes`, `node_memory_*`, `node_load1`, etc.). + The dashboard's pod-level memory/CPU panels are app-process only — a + node running out of disk would silently fail the cluster before any + dashboard signal showed it. Highest-priority Tier-3 item. Fix: deploy + `node-exporter` as a DaemonSet (~50 lines of YAML), add a scrape stanza + to `vmagent-config`, add a `Node disk free` stat panel. +10. **Traefik metrics not enabled.** Traefik can expose `/metrics` with + `traefik_entrypoint_requests_total` + `traefik_service_request_duration_seconds`, + giving edge-level visibility into requests that never reached api + pods (404s, redirects, middleware blocks). Enable via a + HelmChartConfig override that sets `metrics.prometheus.entryPoint=metrics` + + adds a `:9100` entryPoint + a scrape stanza. Skipped today to avoid + Traefik restart risk; safe additive change when ready. +11. **Push notification success/failure counters** (already #5). Add + `prometheus.NewCounterVec` in `internal/push/client.go` with labels + `platform={ios,android}, outcome={success,failed,breaker_open,disabled}`. + Increments at every Send/SendActionable branch. Replaces the + log-derived "Push failures" stat on the dashboard with a real success + rate. +12. **Worker queue / job metrics** (already #6). Asynq has a built-in + Prometheus exporter (`asynq/x/metrics`). Wire it into the worker's + `:6060` health server (a single `healthMux.Handle` line) and + uncomment the worker scrape stanza in `vmagent-config`. Surfaces + queue depth, retry count, processing time per task type. +13. **Cache hit / miss rate.** `internal/services/cache_service.go` has + no counters. Add a Counter with labels `{operation=get|set, result=hit|miss}` + around the cache wrapper. ~10 lines. Useful once real traffic flows + to verify the ETag and Redis caches are paying their keep. +14. **APNs send-latency histogram.** Wrap `internal/push/apns.go::Send` + in a `prometheus.NewHistogramVec` keyed on outcome. Tells you when + Apple's gateway is slow (which correlates with their incident page). --- -## Disaster recovery notes +## 12. Audit trail -### "I have to redeploy the whole stack" - -The deploy path is designed to be re-runnable. From a fresh cluster: - -1. Install k3s on all 3 nodes (use existing `deploy-k3s/scripts/01-install-k3s.sh`) -2. Fetch a kubeconfig (see "Common operations" above) -3. Confirm `deploy/prod.env` has all required secrets: - - `POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`, - `FCM_SERVER_KEY`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`, - `OBS_TRACES_URL`, `REDIS_PASSWORD` (optional), `ADMIN_EMAIL`, `ADMIN_PASSWORD` -4. Run `./deploy-k3s/scripts/02-setup-secrets.sh` (creates `honeydue-secrets`) -5. Run `./deploy-k3s/scripts/03-deploy.sh` (deploys everything; sed-injects - the obs token into vmagent at apply time) -6. Verify: `kubectl -n honeydue get pods` should show all workloads Running - -### Post-redeploy verification checklist - -- [ ] `kubectl -n honeydue get netpol` shows **12 NetPols** (default-deny + - 6 egress + 5 ingress) -- [ ] `kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep 6443` - returns the rule (if missing → see "vmagent SD broken" gotcha) -- [ ] `kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metrics` - shows 1 Running pod -- [ ] `kubectl -n honeydue port-forward deploy/vmagent 8429:8429` + curl - `localhost:8429/api/v1/targets` shows 4+ targets, all `up` -- [ ] Grafana panel "pods up" in `honeydue` namespace populates within 60s - -If any of those fail, this runbook entry tells you exactly which gotcha -you hit. +| Date | Change | +|---|---| +| 2026-04-24 | Initial k3s cluster on Hetzner (Swarm → k3s migration) — see MIGRATION_NOTES.md | +| 2026-04-25 | `config.yaml` reconstructed from live ConfigMap (original file lost) | +| 2026-05-15 | Audit fixes: Redis auth required, admin basic auth, secrets-encryption flag | +| 2026-05-16 | `02-setup-secrets.sh` started carrying B2 credentials (was a manifest/script drift) | +| 2026-06-02 | Kratos scaffolding committed (not deployed) | +| 2026-06-03 | **Hetzner → OVH BHS cutover.** New 3-node cluster on 51.81.83.33, .87.86, .85.248. DNS cut on Cloudflare. Hetzner kubeconfig moved to `.bak`. Grafana `honeydue-eli5-overview` dashboard created. Hetzner cluster powered off later same day. | +| 2026-06-03 | Dashboard build-out: extended `honeydue-eli5-overview` to 22 panels covering Tier-1 (HTTP status, CPU per pod, goroutines, top slow) and Tier-2 (GC, network I/O, pod uptime, top 5xx) signals. Surfaced Tier-3 instrumentation gaps in §11.1. |