Files
Trey t e448ec66dc docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs
Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover:

- Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions,
  network/firewall, DNS, filesystem layout — all reflect the live OVH
  install instead of the historical Hetzner setup.
- Section 6: canonical install-from-clean-boxes procedure (the literal
  commands run on 2026-06-03), so anyone can stand up a backup cluster
  by following along.
- Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away,
  healthy-but-empty) and adds four new ones discovered during the OVH
  build: rbac.yaml not in 03-deploy.sh, namespace label missing from api
  metrics (use service="api"), cluster-label collision when two clusters
  push concurrently, worker double-firing on cutover.
- Section 11.1: enumerates Tier-3 observability gaps surfaced while
  building the honeydue-eli5-overview dashboard (node-exporter not
  deployed, Traefik metrics off, push success counters absent, worker
  /metrics endpoint absent, cache hit rate uninstrumented, APNs latency
  uninstrumented).
- Section 12: dated audit trail of cluster changes.

Pure documentation; no code or manifest changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 09:34:35 -05:00

967 lines
40 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# honeyDue k3s Cluster — Operations Runbook
Living document for the honeyDue production cluster. Add entries when you hit
something non-obvious so future-you (or your replacement) doesn't have to
rediscover it.
Last full revision: **2026-06-03** (Hetzner → OVH BHS cutover; cluster solo
production from that date forward). For pre-OVH history, see
`MIGRATION_NOTES.md` (Swarm → k3s migration on Hetzner, 2026-04-24).
---
## 1. Topology and inventory
### Hosting
| | |
|---|---|
| Provider | OVHcloud (us.ovhcloud.com) |
| Datacenter | BHS — Beauharnois, Quebec, Canada |
| Plan | VPS-1 × 3 (~$6.46/mo each, ~$19/mo total) |
| Node spec | 4 vCPU (Intel Haswell, shared), 7.6 GB RAM, 75 GB NVMe |
| Public bandwidth | 400 Mbps per node, unlimited traffic |
| Private network | **None.** Nodes have public IPv4 + IPv6 only; inter-node traffic crosses the public internet (encrypted by flannel WireGuard backend — see §3) |
### Nodes
| SSH alias | Kubernetes node name | Public IPv4 | Public IPv6 | Roles |
|---|---|---|---|---|
| `ovhcloud1` | `vps-1624d691` | `51.81.83.33` | `2604:2dc0:101:200::5a9a` | control-plane, etcd, redis-pinned |
| `ovhcloud2` | `vps-c0f51be2` | `51.81.87.86` | `2604:2dc0:101:200::30d4` | control-plane, etcd |
| `ovhcloud3` | `vps-dbca24c7` | `51.81.85.248` | `2604:2dc0:101:200::450f` | control-plane, etcd |
The cluster is **all-control-plane** (workloads schedule on the same nodes that
run etcd and the API server). `vps-1624d691` carries the
`honeydue/redis=true` label so the Redis Deployment's `nodeSelector` binds
there; the Redis PVC (`local-path`, host-pinned) lives on that node's disk.
### SSH access
`~/.ssh/config` entries (operator workstation):
```
Host ovhcloud1
HostName 51.81.83.33
Port 22
User ubuntu
IdentityFile ~/.ssh/ovhcloud
IdentitiesOnly yes
Host ovhcloud2
HostName 51.81.87.86
Port 22
User ubuntu
IdentityFile ~/.ssh/ovhcloud
IdentitiesOnly yes
Host ovhcloud3
HostName 51.81.85.248
Port 22
User ubuntu
IdentityFile ~/.ssh/ovhcloud
IdentitiesOnly yes
```
`ubuntu` has passwordless sudo (`/etc/sudoers.d/90-cloud-init-users` from OVH's
cloud-init).
### kubectl access
```bash
export KUBECONFIG=/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/deploy-k3s/kubeconfig
kubectl get nodes
```
The `deploy-k3s/kubeconfig` file (mode 0600, gitignored) is the OVH cluster's
admin kubeconfig with `server: https://51.81.83.33:6443`. A stale Hetzner copy
lives next to it as `kubeconfig.hetzner.bak` for historical reference; the
Hetzner cluster is powered off and that file's API server is unreachable.
To refresh from the cluster (if the local copy is lost or rotated):
```bash
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
```
The k3s API at `:6443` is open to the public internet (token-protected).
---
## 2. Software
### Kernel-level
| | |
|---|---|
| OS | Ubuntu 26.04 LTS (set by OVH's VPS-1 image) |
| Kernel | `7.0.0-14-generic` |
| Init | systemd |
| Container runtime | containerd 2.2.2 (bundled with k3s) |
| Firewall | `ufw` (per-node, configured at install — see §3) |
| Other host packages | `fail2ban` (SSH brute-force protection, default jail), `unattended-upgrades` (security updates), `open-iscsi` (k3s prereq for some storage backends), `curl` |
### Kubernetes
| | |
|---|---|
| Distribution | k3s |
| Version | **`v1.34.6+k3s1`** (pinned in `config.yaml:cluster.k3s_version`) |
| Control plane | 3-node HA, embedded etcd (no external Postgres backing store) |
| CNI / networking | flannel with **WireGuard-native backend** (`--flannel-backend=wireguard-native`). Encrypts pod-to-pod and etcd peer traffic because nodes only have public IPs (no private network). ~3-5% CPU overhead under load. |
| Service LB | klipper-lb (default k3s `servicelb`). The `svclb-traefik` DaemonSet binds host ports `:80` and `:443` on each node and forwards to the Traefik Service. **Not** the DaemonSet-w/-hostNetwork Traefik pattern used on the old Hetzner cluster — see §10 *Differences from MIGRATION_NOTES*. |
| Ingress controller | Traefik (k3s default), single-replica Deployment, exposed via klipper-lb |
| DNS | CoreDNS (k3s default) |
| Secrets encryption | Enabled (`--secrets-encryption`); etcd values are AES-CBC encrypted at rest |
| kubeconfig perms | `0600` (`--write-kubeconfig-mode=0600`) |
| Cloud controller | Disabled (`--disable-cloud-controller`) — no provider integration on OVH |
| Misc | `--node-ip` / `--node-external-ip` / `--advertise-address` all set to each node's public IPv4. TLS SANs cover all 3 IPs so any IP can serve the API. |
### Application stack (in cluster, `honeydue` namespace)
| Deployment | Replicas | Image (digest-pinned) | Notes |
|---|---:|---|---|
| `api` | 3 | `gitea.treytartt.com/admin/honeydue-api@sha256:34fde6...` | Go REST API on `:8000`, exposes `/metrics` |
| `web` | 3 | `gitea.treytartt.com/admin/honeydue-web@sha256:8c62cf...` | Next.js, server-side proxy to api |
| `admin` | 1 | `gitea.treytartt.com/admin/honeydue-admin@sha256:b81263...` | Next.js admin panel, gated behind Traefik basic-auth |
| `worker` | 1 | `gitea.treytartt.com/admin/honeydue-worker@sha256:fe1f5e...` | Asynq scheduler + Redis-backed jobs (singleton — must not run as >1 replica or every cron fires N×) |
| `redis` | 1 | `redis:7-alpine@sha256:6ab0b6...` | Pinned to `vps-1624d691` via `honeydue/redis=true`. PVC `redis-data` (local-path, 5 Gi). Password-auth required. |
| `vmagent` | 1 | `victoriametrics/vmagent@sha256:...` (default tag) | Scrapes api `/metrics` + kube-state-metrics; remote-writes to obs.88oakapps.com |
| `kube-state-metrics` | 1 | `kube-state-metrics@sha256:...` | In `kube-system`, scraped by vmagent for `kube_*` cluster-state metrics |
| `alloy-logs` (DaemonSet) | 3 (1/node) | `grafana/alloy@sha256:...` | Tails `/var/log/pods/*` and ships to Loki at obs.88oakapps.com |
The Asynq scheduler inside `worker` registers these cron jobs:
| Cron | Job | Notes |
|---|---|---|
| `0 * * * *` | Smart reminder check (per-user hour) | Default user hour: 14:00 UTC |
| `0 * * * *` | Daily digest check (per-user hour) | Default user hour: 03:00 UTC |
| `0 10 * * *` | Onboarding emails | 10:00 UTC |
| `0 3 * * *` | Reminder log cleanup | 03:00 UTC |
| `30 * * * *` | Pending uploads cleanup | xx:30 every hour |
### External dependencies
| Service | Endpoint | Purpose | Failure mode |
|---|---|---|---|
| Neon Postgres | `ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech:5432` | App data. Pooler endpoint (transaction-mode PgBouncer in front of Neon compute) so connections stay warm. | api / worker pods crash-loop with `dial tcp: connection refused`. Health endpoint returns `postgres: error`. |
| Backblaze B2 (S3-compatible) | `s3.us-east-005.backblazeb2.com` (bucket `honeyDueProd`) | User uploads (photos, PDFs, completion attachments) | Upload routes return 5xx; reads of cached/static files still work. |
| Cloudflare | `myhoneydue.com` zone | DNS + TLS termination + edge cache + DDoS | Traffic stops reaching origin. Direct `https://51.81.x.x` still works for diagnostics. |
| obs.88oakapps.com | Operator-run Grafana + VictoriaMetrics + Loki | Metrics & logs | vmagent + alloy-logs back off and retry. No app-side impact. |
| Apple APNs | `api.push.apple.com:443` (production) | iOS push notifications | Push fails; circuit breaker opens; failure logged. App functionality unaffected. |
| Fastmail SMTP | `smtp.fastmail.com:587` | Transactional emails (verification, recovery, digests) | Email send fails in the worker; logged; user reset/digest flow degrades. |
| Gitea registry | `gitea.treytartt.com` | Container image registry | Deploys can't pull. Existing pods keep running on cached images. |
---
## 3. Network and firewall
### Per-node `ufw` configuration
Applied during install (same on all 3 nodes):
```
default deny incoming
default allow outgoing
allow 22/tcp (SSH, world)
allow 80/tcp (HTTP via Cloudflare, world — see GAP-1)
allow 443/tcp (HTTPS, same — GAP-1)
allow 6443/tcp (k3s API, world, token-protected)
allow 2379:2380/tcp from <other 2 OVH IPs> (etcd client + peer)
allow 10250/tcp from <other 2 OVH IPs> (kubelet)
allow 51820/udp from <other 2 OVH IPs> (WireGuard tunnel)
allow 8472/udp from <other 2 OVH IPs> (VXLAN, defense-in-depth fallback)
```
To inspect: `ssh ovhcloudN sudo ufw status numbered`.
### Cluster networking
- **Pod CIDR**: `10.42.0.0/16` (default k3s)
- **Service CIDR**: `10.43.0.0/16` (default k3s)
- **Flannel backend**: WireGuard-native. Each node hosts a `flannel-wg` interface on UDP 51820 and tunnels pod traffic to peers. Verify: `ssh ovhcloudN ip -d link show flannel-wg`.
### Traefik ingress flow
```
Cloudflare → node:80/443 (public)
→ klipper-lb svclb-traefik DaemonSet pod (hostPort:80/443)
→ Traefik Service (ClusterIP 10.43.245.127:80/443)
→ Traefik Deployment pod (single replica)
→ matches Ingress host rule (api.myhoneydue.com etc.)
→ routes to backend Service (api / web / admin)
→ backend Pod
```
The Traefik default also lives in `kube-system` and is managed by k3s's
HelmChart. **No HelmChartConfig override is applied on OVH** (unlike Hetzner
— see §10).
---
## 4. DNS configuration (Cloudflare)
The `myhoneydue.com` zone in Cloudflare has these public records. **All
hostnames are proxied (orange cloud)** — required by the `cloudflare-only`
Traefik middleware which 403s any non-CF source IP.
| Host | Type | Values | Proxy |
|---|---|---|---|
| `api.myhoneydue.com` | A × 3 | `51.81.83.33`, `51.81.87.86`, `51.81.85.248` | Proxied |
| `app.myhoneydue.com` | A × 3 | (same trio) | Proxied |
| `admin.myhoneydue.com` | A × 3 | (same trio) | Proxied |
| `myhoneydue.com` (apex `@`) | A × 3 | (same trio) | Proxied |
Cloudflare round-robins among the 3 origins, klipper-lb on whichever node CF
hits forwards to Traefik, and Traefik routes by Host header. Per-request,
effectively load-balanced across the 3 nodes for ingress, with no central LB.
**SSL/TLS mode**: Flexible (CF terminates TLS at the edge; origin is plain
HTTP on `:80`). Upgrading to Full (strict) is on the deferred list — would
need an origin certificate provisioned to `cloudflare-origin-cert` secret and
Traefik configured for TLS termination.
---
## 5. Filesystem layout (`deploy-k3s/`)
```
deploy-k3s/
├── config.yaml # Single config source (gitignored; contains tokens)
├── config.yaml.example # Template
├── kubeconfig # OVH admin kubeconfig (gitignored, 0600)
├── kubeconfig.hetzner.bak # Old Hetzner kubeconfig (unreachable, kept for history)
├── kubeconfig.tunnel # Optional: localhost-pointing copy for SSH-tunnel use
├── secrets/
│ ├── README.md
│ ├── postgres_password.txt # Neon DB password
│ ├── secret_key.txt # 32+ char app-token signing secret
│ ├── email_host_password.txt # Fastmail SMTP app password
│ ├── fcm_server_key.txt # FCM server key (currently unused — Android push disabled)
│ ├── apns_auth_key.p8 # APNs auth key (binary)
│ ├── cloudflare-origin.crt # Origin certificate (currently unused — CF Flexible)
│ └── cloudflare-origin.key
│ (all gitignored except README.md)
├── manifests/
│ ├── namespace.yaml
│ ├── network-policies.yaml # default-deny + per-app egress/ingress (13 NetPols total)
│ ├── rbac.yaml # api/worker/admin/web/redis ServiceAccounts (NOT applied by 03-deploy.sh; manual once)
│ ├── pod-disruption-budgets.yaml # api-pdb, web-pdb, worker-pdb (NOT applied by 03-deploy.sh; manual once)
│ ├── traefik-helmchartconfig.yaml # Hetzner-only DaemonSet+hostNetwork override (do NOT apply on OVH; we use default klipper-lb)
│ ├── kyverno-verify-images.yaml # Operator-gated policy (do NOT apply blindly — see file comment)
│ ├── api/{deployment,service,hpa}.yaml
│ ├── worker/deployment.yaml
│ ├── admin/{deployment,service}.yaml
│ ├── web/{deployment,service}.yaml
│ ├── redis/{deployment,service,pvc}.yaml
│ ├── ingress/{middleware,ingress-simple}.yaml
│ ├── migrate/job.yaml # goose migration Job (image-subbed at deploy time)
│ ├── observability/{kube-state-metrics,vmagent,alloy-logs}.yaml
│ └── kratos/ # Ory Kratos identity service (NOT yet deployed; gated on operator OIDC setup)
└── scripts/
├── _config.sh # Sourced by all scripts: cfg(), generate_env(), generate_cluster_config()
├── 01-provision-cluster.sh # Hetzner-Cloud-specific (uses hetzner-k3s CLI) — DO NOT RUN ON OVH
├── 02-setup-secrets.sh # Creates honeydue-secrets etc. from secrets/ + config.yaml; kubeconfig-driven
├── 03-deploy.sh # Build + push + apply manifests + roll deployments; kubeconfig-driven
├── 04-verify.sh # Post-deploy health + security checks; kubeconfig-driven
└── rollback.sh # `kubectl rollout undo` across all deployments
```
The `deploy/prod.env` file (sibling to `deploy-k3s/`, gitignored) holds
observability + admin credentials that `02/03-deploy.sh` read but never
display:
```
OBS_INGEST_URL (https://obs.88oakapps.com/api/v1/write)
OBS_TRACES_URL (https://obs.88oakapps.com/v1/traces)
OBS_INGEST_TOKEN (bearer token for VM + Loki + traces — all use same token)
GRAFANA_URL (https://grafana.88oakapps.com)
GRAFANA_ADMIN_USER (admin)
GRAFANA_ADMIN_PASSWORD
ADMIN_EMAIL / ADMIN_PASSWORD (in-app admin login)
```
---
## 6. Install from clean boxes — the truthful procedure
This is what we ran on 2026-06-03 to stand up the live cluster, exactly. If
you ever rebuild from zero this is the canonical sequence. Total wall-clock:
~12 min for cluster bootstrap; ~10 min for workloads.
### 6.1 Prerequisites
- 3 fresh Ubuntu VPS instances (any provider with public IPv4, ≥4 GB RAM,
≥40 GB disk)
- `~/.ssh/config` entries (`ovhcloud1/2/3`) pointing at them, with
passwordless sudo
- Local `kubectl` and `curl`
- The repo's `deploy-k3s/secrets/` populated (or the ability to copy live
secrets from another running cluster — see §7.2)
- `deploy/prod.env` populated with obs token + Grafana creds
### 6.2 Per-node OS hardening + firewall (all 3 in parallel)
For each `ovhcloudN`, over SSH:
```sh
export DEBIAN_FRONTEND=noninteractive
sudo apt-get update -qq
sudo apt-get install -y -qq fail2ban unattended-upgrades open-iscsi curl ufw
sudo systemctl enable --now iscsid fail2ban
sudo dpkg-reconfigure -f noninteractive -plow unattended-upgrades
sudo ufw --force reset
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 6443/tcp
SELF=$(hostname -I | awk '{print $1}')
for peer in 51.81.83.33 51.81.87.86 51.81.85.248; do
[ "$peer" = "$SELF" ] && continue
sudo ufw allow from "$peer" to any port 2379:2380 proto tcp
sudo ufw allow from "$peer" to any port 10250 proto tcp
sudo ufw allow from "$peer" to any port 51820 proto udp
sudo ufw allow from "$peer" to any port 8472 proto udp
done
sudo ufw --force enable
```
**Watch ordering:** `allow 22/tcp` MUST precede `ufw enable`. Existing SSH
sessions survive (`ufw` only affects new connections), but a misordered script
locks you out of fresh logins.
### 6.3 Install k3s on `ovhcloud1` (the init node)
```sh
ssh ovhcloud1 'curl -sfL https://get.k3s.io | \
INSTALL_K3S_VERSION=v1.34.6+k3s1 \
sh -s - server \
--cluster-init \
--node-ip=51.81.83.33 \
--node-external-ip=51.81.83.33 \
--advertise-address=51.81.83.33 \
--flannel-backend=wireguard-native \
--flannel-external-ip \
--secrets-encryption \
--write-kubeconfig-mode=0600 \
--tls-san=51.81.83.33 \
--tls-san=51.81.87.86 \
--tls-san=51.81.85.248 \
--disable-cloud-controller'
```
Wait for `sudo k3s kubectl get nodes` to show this node Ready (~2-5 s).
Read the cluster token:
```sh
ssh ovhcloud1 'sudo cat /var/lib/rancher/k3s/server/node-token'
```
### 6.4 Join `ovhcloud2`, then `ovhcloud3` (sequential)
Joining etcd one node at a time avoids split-brain on slow networks.
Replace `<TOKEN>` with the value from 6.3.
For `ovhcloud2`:
```sh
ssh ovhcloud2 'curl -sfL https://get.k3s.io | \
INSTALL_K3S_VERSION=v1.34.6+k3s1 \
K3S_TOKEN=<TOKEN> \
sh -s - server \
--server=https://51.81.83.33:6443 \
--node-ip=51.81.87.86 \
--node-external-ip=51.81.87.86 \
--advertise-address=51.81.87.86 \
--flannel-backend=wireguard-native \
--flannel-external-ip \
--secrets-encryption \
--write-kubeconfig-mode=0600 \
--tls-san=51.81.83.33 --tls-san=51.81.87.86 --tls-san=51.81.85.248 \
--disable-cloud-controller'
```
Then identical for `ovhcloud3` with `--node-ip=51.81.85.248` and
`--advertise-address=51.81.85.248`. After each, wait for `kubectl get nodes`
to show the new node Ready before proceeding.
### 6.5 Pull kubeconfig to the operator workstation
```sh
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
kubectl get nodes -o wide # All 3 Ready, INTERNAL-IP = public IP
```
### 6.6 Label the redis node
```sh
kubectl label node vps-1624d691 honeydue/redis=true --overwrite
```
(Use whichever k8s node name corresponds to `ovhcloud1`. The Redis
Deployment's `nodeSelector` binds to this label.)
### 6.7 Bootstrap manifests NOT applied by `03-deploy.sh`
These must be applied manually on a fresh cluster, **before** running
`03-deploy.sh`, or workloads will fail to schedule:
```sh
kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl apply -f deploy-k3s/manifests/pod-disruption-budgets.yaml
```
`rbac.yaml` creates the 5 ServiceAccounts (`api`, `worker`, `admin`, `web`,
`redis`) referenced by the Deployment manifests. Without these, ReplicaSets
hang on `FailedCreate: error looking up service account` and pods never
start. Symptom on first deploy: `kubectl get deploy` shows `0 up-to-date`
across the board with no pod activity — see §9 *Gotchas*.
**Do NOT apply** `traefik-helmchartconfig.yaml` (Hetzner-only — see §10) or
`kyverno-verify-images.yaml` (gated on operator Kyverno install).
### 6.8 Seed secrets
Two paths; pick whichever fits your situation:
**Path A — clean install from local files** (the original design):
```sh
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/02-setup-secrets.sh
```
Requires `deploy-k3s/secrets/` to contain real `postgres_password.txt`,
`secret_key.txt`, `email_host_password.txt`, `fcm_server_key.txt`,
`apns_auth_key.p8`, `cloudflare-origin.crt`, `cloudflare-origin.key`. The
script reads `config.yaml` for `registry.*`, `redis.password`,
`admin.basic_auth_*`, and `storage.b2_*`.
**Path B — clone live secrets from another running cluster** (what we
actually did during the migration; useful if `secrets/` is empty or you want
exact-byte equivalence):
```sh
HETZNER=$(pwd)/deploy-k3s/kubeconfig.hetzner.bak # or any kubeconfig with the secrets
OVH=$(pwd)/deploy-k3s/kubeconfig
kubectl --kubeconfig=$OVH apply -f deploy-k3s/manifests/namespace.yaml
for S in honeydue-secrets honeydue-apns-key gitea-credentials cloudflare-origin-cert admin-basic-auth; do
kubectl --kubeconfig=$HETZNER -n honeydue get secret $S -o json \
| python3 -c "
import json, sys
d = json.load(sys.stdin)
m = d['metadata']
for k in ('uid','resourceVersion','creationTimestamp','generation','managedFields','ownerReferences','selfLink'):
m.pop(k, None)
m.pop('annotations', None)
print(json.dumps(d))" \
| kubectl --kubeconfig=$OVH apply -f -
done
```
After either path, verify:
```sh
kubectl -n honeydue get secrets
# Expect: admin-basic-auth, cloudflare-origin-cert, gitea-credentials,
# honeydue-apns-key, honeydue-secrets
```
### 6.9 Deploy workloads
```sh
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig \
./deploy-k3s/scripts/03-deploy.sh --skip-build --tag latest
```
- `--skip-build` skips Docker build + push, deploys whatever's already in the
registry at the named tag. Use this when migrating between clusters to
guarantee both run identical bits.
- Without flags it builds the api / worker / admin / web images from the
local repo HEAD and pushes to `gitea.treytartt.com` first.
- The script applies (in order): namespace, network-policies (13 of them),
redis, ingress, then runs the goose migration Job (blocking on success),
then api / worker / admin / web Deployments, then observability
(kube-state-metrics, vmagent, alloy-logs).
- It does NOT apply: `rbac.yaml`, `pod-disruption-budgets.yaml`,
`traefik-helmchartconfig.yaml`, `kyverno-verify-images.yaml`. The first
two must be applied manually (see §6.7); the latter two are Hetzner-only
or operator-gated.
- It does NOT apply: anything under `kratos/` (skipped until
`kratos-secrets` exists, which requires real OIDC client IDs).
### 6.10 Verify
```sh
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/04-verify.sh
```
Expect: all deployments `READY=desired`, 13 NetworkPolicies, 7 ServiceAccounts
(api, worker, admin, web, redis, vmagent, alloy-logs), 3 PDBs, cloudflare-only
middleware present, in-cluster `/api/health/` returns 200.
External smoke test (DNS-aware, but the api `/health/` route is exempt from
the cloudflare-only middleware so direct-IP works for diagnostics):
```sh
for IP in 51.81.83.33 51.81.87.86 51.81.85.248; do
curl -s -o /dev/null -w "$IP -> %{http_code}\n" \
-H 'Host: api.myhoneydue.com' http://$IP/api/health/
done
# All three should return 200.
```
### 6.11 DNS cutover (if migrating)
In the Cloudflare dashboard for `myhoneydue.com`, set the 4 hostnames in §4 to
the OVH IPs and keep proxied. Effective propagation ~30 s to 5 min through
the Cloudflare proxy.
If you have a previous cluster, **scale its worker to 0 before flipping** to
avoid scheduled-job double-fires:
```sh
KUBECONFIG=<previous> kubectl -n honeydue scale deploy/worker --replicas=0
# (cut DNS)
KUBECONFIG=<new> kubectl -n honeydue scale deploy/worker --replicas=1
```
Run those last two lines back-to-back. Worker work is mostly scheduled
(hourly+), so a brief gap is harmless; overlap would cause duplicate emails.
---
## 7. Day-to-day operations
### Common kubectl one-liners
```sh
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
# Cluster state
kubectl get nodes -o wide
kubectl -n honeydue get pods
kubectl -n honeydue get deploy
kubectl top nodes
kubectl -n honeydue top pods
# Tail logs
kubectl -n honeydue logs deploy/api -f --tail=50
kubectl -n honeydue logs -l app.kubernetes.io/name=api -f --tail=20
stern -n honeydue api # if stern is installed (multi-pod)
# Restart a deployment (no image change, picks up ConfigMap changes)
kubectl -n honeydue rollout restart deploy/api
# Rollback one revision
kubectl -n honeydue rollout undo deploy/api
# Scale (worker MUST stay at 0 or 1)
kubectl -n honeydue scale deploy/api --replicas=4
# Get into a pod
kubectl -n honeydue exec -it deploy/api -- sh
```
### Redeploy after code changes
```sh
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/03-deploy.sh
```
Builds images from local HEAD, tags with the git short SHA, pushes to Gitea,
runs `goose up` (idempotent), rolls api/worker/admin/web. Total: ~3-5 min
when images change.
To deploy without rebuilding (pin to a specific tag):
```sh
./deploy-k3s/scripts/03-deploy.sh --skip-build --tag <tag-or-:latest>
```
### Migrations
Goose migrations live in `migrations/`. New file pattern:
```
make migrate-new name=add_foo_column # generates migrations/YYYYMMDDHHMMSS_add_foo_column.sql
# Edit the file with -- +goose Up / -- +goose Down sections
```
`03-deploy.sh` runs a one-shot Job (`manifests/migrate/job.yaml`) that
executes `goose up` against Neon (direct compute endpoint, not pooler — see
file comment). The Job blocks api/worker rollout and aborts the deploy on
failure. No app pod runs `AutoMigrate`; api/worker startup verifies
`goose_db_version` is current and refuses to boot on mismatch.
### Grafana
URL: https://grafana.88oakapps.com (creds in `deploy/prod.env`)
Three dashboards in the `honeyDue` folder:
| UID | Title | Use |
|---|---|---|
| `honeydue-eli5-overview` | honeyDue — Overview (ELI5) | Single-screen at-a-glance health: pods up, crashes, errors, RPS, latency, Postgres, memory, top endpoints, push failures, worker activity, recent error logs. Created 2026-06-03. |
| `honeydue-red` | honeyDue API — RED | Rate/Errors/Duration cuts (legacy) |
| `honeydue-logs` | honeyDue — Production Logs | Live log explorer |
For the ELI5 dashboard's queries, **api-side metrics use `service="api"`,
NOT `namespace="honeydue"`.** vmagent's scrape config drops the namespace
label from api metrics — only `service`, `pod`, `node`, `job`, plus the
metric's own labels (route, method, status, etc.) survive. Queries that
filter on `namespace="honeydue"` for api metrics silently match nothing.
### kubectl tunnel (if 6443 is firewalled to your IP)
Currently `6443` is open WAN-side (matching the previous Hetzner posture).
If you tighten that to operator-IPs-only and your IP changes, use an SSH
tunnel:
```sh
ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
-i ~/.ssh/ovhcloud \
-L 127.0.0.1:6443:127.0.0.1:6443 \
ubuntu@51.81.83.33
cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
sed -i.bak 's|https://51.81.83.33:6443|https://127.0.0.1:6443|' deploy-k3s/kubeconfig.tunnel
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
```
---
## 8. Disaster recovery
### "I lost the kubeconfig"
```sh
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
```
If `ovhcloud1` is down but `ovhcloud2` or `3` is up, swap host and IP — the
TLS SAN covers all three.
### "A node is unresponsive"
```sh
kubectl drain vps-XXX --ignore-daemonsets --delete-emptydir-data
# Reboot via OVH manager or:
ssh ovhcloudN sudo reboot
# Wait for Ready, then:
kubectl uncordon vps-XXX
```
The cluster tolerates 1 node down (etcd quorum 2/3). With 2 down, etcd
loses quorum and the API server stops accepting writes.
### "etcd quorum lost (2+ nodes dead)"
Bring nodes back online if possible. If not:
```sh
ssh ovhcloud1 'sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<latest>'
```
k3s takes automatic etcd snapshots every 12h, keeping 5. List with:
```sh
ssh ovhcloud1 sudo ls -la /var/lib/rancher/k3s/server/db/snapshots/
```
This is destructive — workload state since the snapshot is lost, but Neon
(actual app data) is unaffected.
### "I have to rebuild the whole cluster from scratch"
Provision 3 fresh boxes, then exactly the sequence in §6. End-to-end is
~30 min. The dependencies that make this possible:
| Stays put through rebuild | Where |
|---|---|
| Application data | Neon Postgres (managed) |
| User uploads | Backblaze B2 (managed) |
| Container images | `gitea.treytartt.com` (self-hosted, but not on the OVH cluster) |
| Operator secrets | `deploy-k3s/secrets/` + `config.yaml` + `deploy/prod.env` on the operator workstation (gitignored) |
| DNS | Cloudflare control panel |
If `gitea.treytartt.com` is on the same OVH cluster, you have a circular
dependency — rebuilding requires images you can't pull until the cluster is
up. Currently Gitea is NOT in the honeyDue cluster (separate Hetzner-era
host), so this isn't a problem today, but worth flagging if that ever
changes.
### "Cutover back to Hetzner / failover to a backup cluster"
There is **no warm standby today.** Bringing up a second cluster is the
same §6 procedure on different hardware, then a Cloudflare DNS swap. The
worker-swap dance is critical:
```sh
KUBECONFIG=<current> kubectl -n honeydue scale deploy/worker --replicas=0
# (Update Cloudflare DNS to new cluster's IPs — proxied)
KUBECONFIG=<new> kubectl -n honeydue scale deploy/worker --replicas=1
```
---
## 9. Known gotchas
### 9.1 First-deploy "0 up-to-date" across all Deployments
**Symptoms:** `kubectl get deploy` shows `READY 0/N, UP-TO-DATE 0` for
api/worker/admin/web/redis. `kubectl get events` shows
`FailedCreate: error looking up service account honeydue/<name>: serviceaccount "..." not found`.
**Cause:** `rbac.yaml` (ServiceAccounts) is NOT applied by `03-deploy.sh`. On
a fresh cluster the SAs don't exist; the ReplicaSet controller can't create
pods.
**Fix:**
```sh
kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl -n honeydue rollout restart deploy/api deploy/worker deploy/admin deploy/web deploy/redis
```
This was hit during the 2026-06-03 OVH bootstrap. Permanently fix by adding
`kubectl apply -f rbac.yaml` to `03-deploy.sh` between the namespace and
network-policies apply, but until that lands, follow §6.7 on every fresh
cluster.
### 9.2 vmagent SD broken on fresh deploy ("0 pods up" in Grafana)
**Symptoms:**
- Grafana panels using `kube_*` metrics or `up{job=...}` show 0
- vmagent logs: `dial tcp 10.43.0.1:443: connect: connection refused` every ~30 s
- Direct test from a pod also refused
**Cause:** k3s's NetworkPolicy controller evaluates egress rules *after*
kube-proxy's DNAT (not before, contrary to spec). Pod-to-`kubernetes`-Service
(`10.43.0.1:443`) gets DNAT'd to `<node_ip>:6443`, *then* the policy check
runs. Without an explicit egress rule for `:6443`, the packet is rejected.
The `allow-egress-from-vmagent` NetPol in `network-policies.yaml` includes
both rules:
```yaml
- to:
- ipBlock: { cidr: 10.43.0.0/16 }
ports:
- { port: 443, protocol: TCP }
- to:
- ipBlock:
cidr: 0.0.0.0/0
except: [10.42.0.0/16]
ports:
- { port: 6443, protocol: TCP }
```
**If this happens:** confirm `network-policies.yaml` was applied:
```sh
kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443
```
Counter-evidence that confirms diagnosis: `kube-state-metrics` in
`kube-system` works fine (no NetPols in that namespace).
### 9.3 vmagent appears healthy but no data in Grafana
vmagent's `/-/healthy` returns 200 as long as the process is alive and
remote-write is TCP-functional. It doesn't check that scrapes are actually
*succeeding*. The liveness probe in `vmagent.yaml` queries `/api/v1/targets`
and fails the pod if no target is `up`. After ~3 failures (~3 min), kubelet
recycles it.
If vmagent runs for weeks but Grafana is empty, the probe was disabled or
the exec command broke.
### 9.4 vmagent bearer token destroyed by direct `kubectl apply`
The committed `vmagent.yaml` has `bearer_token: TOKEN_PLACEHOLDER`. The real
token is `sed`-substituted at deploy time by `03-deploy.sh`. Applying the
file directly:
```sh
kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml # WRONG
```
overwrites the Secret with the literal `TOKEN_PLACEHOLDER` and remote-writes
401. To restore without a full redeploy:
```sh
OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.OBS_INGEST_TOKEN}')
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
-p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
kubectl -n honeydue rollout restart deploy/vmagent
```
Or just re-run `./deploy-k3s/scripts/03-deploy.sh` — the sed handles it.
### 9.5 Dashboard queries: api metrics need `service="api"` not `namespace="honeydue"`
vmagent's scrape config (`vmagent-config` ConfigMap) explicitly chooses which
Kubernetes pod-metadata labels to copy onto each scraped series. **Namespace
isn't one of them.** Labels you can use on api-side metrics:
- `service` (literal `"api"`)
- `job` (literal `"api"`)
- `pod` (the api pod name)
- `node` (the k8s node name)
- `cluster` (vmagent external_label, currently `"honeydue-k3s"`)
- `environment` (vmagent external_label, currently `"prod"`)
- Plus each metric's own labels (`method`, `route`, `status` for HTTP; etc.)
`kube_*` metrics from kube-state-metrics DO carry `namespace` natively
(KSM publishes it as a label, vmagent passes it through). Loki streams have
`namespace` because alloy-logs explicitly relabels it. So the rule is:
| Metric prefix | Use |
|---|---|
| `kube_*` | `namespace="honeydue"` |
| `http_*`, `gorm_*`, `go_*`, `process_*` (api) | `service="api"` |
| Loki logs `{...}` | `namespace="honeydue"` |
### 9.6 Cluster-label collision when two clusters run together
Both Hetzner and OVH vmagents push as `cluster=honeydue-k3s, environment=prod`
(same external_labels). During the migration overlap this made dashboards
sum both clusters' data. The simplest narrowing during overlap is by node
name pattern (`node=~"vps-.*"` for OVH, `node=~"ubuntu-.*"` for Hetzner). If
you ever bring up a backup cluster long-term, change one cluster's
`external_labels.cluster` to something distinct (e.g. `honeydue-ovh`
vs. `honeydue-backup`).
### 9.7 Worker double-firing scheduled jobs
If two `worker` Deployments run concurrently (e.g. two clusters both pointing
at the same Neon DB), Asynq schedulers each fire crons independently — users
get duplicate emails. Workaround: scale all-but-one worker to 0. This is the
exact mechanic used during cutovers (§6.11).
### 9.8 Node kubeconfig mode
`/etc/rancher/k3s/k3s.yaml` on each node is mode `0600` because we install
with `--write-kubeconfig-mode=0600`. Tightening from k3s default (0644) was
intentional. Don't change without coordinating — any tooling on the node
that expects to read it (none today) will break.
---
## 10. Differences from MIGRATION_NOTES.md (Hetzner-era)
`MIGRATION_NOTES.md` documents the Swarm → k3s migration on Hetzner
(2026-04-24). Most of it still applies, with these OVH-specific deltas:
| What MIGRATION_NOTES says | What OVH actually has |
|---|---|
| `hetzner-k3s` provisioner | Manual k3s install (§6) |
| Hetzner Load Balancer (not used) → Cloudflare round-robin | Same — Cloudflare round-robin (§4) |
| Traefik as DaemonSet + hostNetwork via HelmChartConfig | Traefik default Deployment + klipper-lb svclb DaemonSet. The `traefik-helmchartconfig.yaml` file is **NOT applied** on OVH. |
| `servicelb` disabled (`--disable=servicelb`) | `servicelb` enabled (we didn't pass `--disable=servicelb`). This is what makes klipper-lb work. |
| sysctl `net.ipv4.ip_unprivileged_port_start=0` for hostNetwork Traefik | Not needed — klipper-lb proxies the port binding instead |
| UFW rules between 3 Hetzner IPs | UFW rules between 3 OVH IPs (51.81.83.33, 51.81.87.86, 51.81.85.248) |
| Kubeconfig at `~/.kube/honeydue-k3s.yaml` | Kubeconfig at `deploy-k3s/kubeconfig` |
| TLS at origin: not configured (CF Flexible) | Same — CF Flexible. `cloudflare-origin-cert` Secret exists (carried over) but Ingress doesn't reference it. |
---
## 11. Outstanding follow-ups (deferred, not blocking)
1. **No warm standby / rollback cluster.** OVH is solo production. An OVH
outage is a real outage; recovery time = §6 procedure (~30 min). User
plans to bring a second cluster up as a target.
2. **UFW allows 80/443 from world.** Hetzner had a network-layer Cloudflare-IP
allowlist on these ports. OVH currently relies on the L7
`cloudflare-only` Traefik middleware, which protects admin but NOT api /
web / apex (those routes have to be reachable from anywhere, but they're
then trivially DDoSable bypassing Cloudflare). Fix: add ufw allow rules
restricting `80/tcp` and `443/tcp` to Cloudflare's published IP ranges
(~22 IPv4 prefixes from https://www.cloudflare.com/ips-v4/).
3. **Cloudflare TLS Flexible → Full(strict).** Origin certs exist as Secret
but Ingress doesn't terminate TLS. Upgrading to Full(strict) requires
Traefik configured with the cert + an HTTPS entrypoint + Ingress
`tls:` block.
4. **`rbac.yaml` + `pod-disruption-budgets.yaml` should be in `03-deploy.sh`.**
They're currently bootstrap-only. Adding them is idempotent and prevents
the §9.1 footgun.
5. **Push notification metrics are log-derived, not counters.** Successes
aren't logged or counted. Proper Prometheus instrumentation (~15 lines in
`internal/push/client.go`) would give a real success/failure ratio.
6. **Worker has no `/metrics` endpoint.** `cmd/worker/main.go` serves `:6060`
for healthz only. Adding Asynq's `metrics.NewPrometheusExporter()` + a
ServiceMonitor + uncommenting the `worker` job stanza in
`vmagent-config` ConfigMap would give real queue depth and job latency.
7. **Ory Kratos.** Manifests exist (`manifests/kratos/`) but the deploy
is gated on operator-side prerequisites (Neon `kratos` database,
`auth.myhoneydue.com` DNS, real Apple+Google OIDC clients, Kratos image
tag pinned). Until `kratos-secrets` exists, `03-deploy.sh` silently
skips the Kratos apply.
8. **Hetzner cluster fully retired? `config.yaml` `nodes:` block describes
OVH; the bak kubeconfig is at `kubeconfig.hetzner.bak`. Boxes themselves
are operator-managed.
### 11.1 Dashboard observability gaps (raised 2026-06-03 during dashboard build)
Surfaced while building the `honeydue-eli5-overview` Grafana dashboard. Each
needs code or infra changes to expose; none blocks today's operations.
9. **node-exporter not deployed.** No node-level metrics today
(`node_filesystem_avail_bytes`, `node_memory_*`, `node_load1`, etc.).
The dashboard's pod-level memory/CPU panels are app-process only — a
node running out of disk would silently fail the cluster before any
dashboard signal showed it. Highest-priority Tier-3 item. Fix: deploy
`node-exporter` as a DaemonSet (~50 lines of YAML), add a scrape stanza
to `vmagent-config`, add a `Node disk free` stat panel.
10. **Traefik metrics not enabled.** Traefik can expose `/metrics` with
`traefik_entrypoint_requests_total` + `traefik_service_request_duration_seconds`,
giving edge-level visibility into requests that never reached api
pods (404s, redirects, middleware blocks). Enable via a
HelmChartConfig override that sets `metrics.prometheus.entryPoint=metrics`
+ adds a `:9100` entryPoint + a scrape stanza. Skipped today to avoid
Traefik restart risk; safe additive change when ready.
11. **Push notification success/failure counters** (already #5). Add
`prometheus.NewCounterVec` in `internal/push/client.go` with labels
`platform={ios,android}, outcome={success,failed,breaker_open,disabled}`.
Increments at every Send/SendActionable branch. Replaces the
log-derived "Push failures" stat on the dashboard with a real success
rate.
12. **Worker queue / job metrics** (already #6). Asynq has a built-in
Prometheus exporter (`asynq/x/metrics`). Wire it into the worker's
`:6060` health server (a single `healthMux.Handle` line) and
uncomment the worker scrape stanza in `vmagent-config`. Surfaces
queue depth, retry count, processing time per task type.
13. **Cache hit / miss rate.** `internal/services/cache_service.go` has
no counters. Add a Counter with labels `{operation=get|set, result=hit|miss}`
around the cache wrapper. ~10 lines. Useful once real traffic flows
to verify the ETag and Redis caches are paying their keep.
14. **APNs send-latency histogram.** Wrap `internal/push/apns.go::Send`
in a `prometheus.NewHistogramVec` keyed on outcome. Tells you when
Apple's gateway is slow (which correlates with their incident page).
---
## 12. Audit trail
| Date | Change |
|---|---|
| 2026-04-24 | Initial k3s cluster on Hetzner (Swarm → k3s migration) — see MIGRATION_NOTES.md |
| 2026-04-25 | `config.yaml` reconstructed from live ConfigMap (original file lost) |
| 2026-05-15 | Audit fixes: Redis auth required, admin basic auth, secrets-encryption flag |
| 2026-05-16 | `02-setup-secrets.sh` started carrying B2 credentials (was a manifest/script drift) |
| 2026-06-02 | Kratos scaffolding committed (not deployed) |
| 2026-06-03 | **Hetzner → OVH BHS cutover.** New 3-node cluster on 51.81.83.33, .87.86, .85.248. DNS cut on Cloudflare. Hetzner kubeconfig moved to `.bak`. Grafana `honeydue-eli5-overview` dashboard created. Hetzner cluster powered off later same day. |
| 2026-06-03 | Dashboard build-out: extended `honeydue-eli5-overview` to 22 panels covering Tier-1 (HTTP status, CPU per pod, goroutines, top slow) and Tier-2 (GC, network I/O, pod uptime, top 5xx) signals. Surfaced Tier-3 instrumentation gaps in §11.1. |