e448ec66dc
Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover: - Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions, network/firewall, DNS, filesystem layout — all reflect the live OVH install instead of the historical Hetzner setup. - Section 6: canonical install-from-clean-boxes procedure (the literal commands run on 2026-06-03), so anyone can stand up a backup cluster by following along. - Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away, healthy-but-empty) and adds four new ones discovered during the OVH build: rbac.yaml not in 03-deploy.sh, namespace label missing from api metrics (use service="api"), cluster-label collision when two clusters push concurrently, worker double-firing on cutover. - Section 11.1: enumerates Tier-3 observability gaps surfaced while building the honeydue-eli5-overview dashboard (node-exporter not deployed, Traefik metrics off, push success counters absent, worker /metrics endpoint absent, cache hit rate uninstrumented, APNs latency uninstrumented). - Section 12: dated audit trail of cluster changes. Pure documentation; no code or manifest changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
967 lines
40 KiB
Markdown
967 lines
40 KiB
Markdown
# honeyDue k3s Cluster — Operations Runbook
|
||
|
||
Living document for the honeyDue production cluster. Add entries when you hit
|
||
something non-obvious so future-you (or your replacement) doesn't have to
|
||
rediscover it.
|
||
|
||
Last full revision: **2026-06-03** (Hetzner → OVH BHS cutover; cluster solo
|
||
production from that date forward). For pre-OVH history, see
|
||
`MIGRATION_NOTES.md` (Swarm → k3s migration on Hetzner, 2026-04-24).
|
||
|
||
---
|
||
|
||
## 1. Topology and inventory
|
||
|
||
### Hosting
|
||
|
||
| | |
|
||
|---|---|
|
||
| Provider | OVHcloud (us.ovhcloud.com) |
|
||
| Datacenter | BHS — Beauharnois, Quebec, Canada |
|
||
| Plan | VPS-1 × 3 (~$6.46/mo each, ~$19/mo total) |
|
||
| Node spec | 4 vCPU (Intel Haswell, shared), 7.6 GB RAM, 75 GB NVMe |
|
||
| Public bandwidth | 400 Mbps per node, unlimited traffic |
|
||
| Private network | **None.** Nodes have public IPv4 + IPv6 only; inter-node traffic crosses the public internet (encrypted by flannel WireGuard backend — see §3) |
|
||
|
||
### Nodes
|
||
|
||
| SSH alias | Kubernetes node name | Public IPv4 | Public IPv6 | Roles |
|
||
|---|---|---|---|---|
|
||
| `ovhcloud1` | `vps-1624d691` | `51.81.83.33` | `2604:2dc0:101:200::5a9a` | control-plane, etcd, redis-pinned |
|
||
| `ovhcloud2` | `vps-c0f51be2` | `51.81.87.86` | `2604:2dc0:101:200::30d4` | control-plane, etcd |
|
||
| `ovhcloud3` | `vps-dbca24c7` | `51.81.85.248` | `2604:2dc0:101:200::450f` | control-plane, etcd |
|
||
|
||
The cluster is **all-control-plane** (workloads schedule on the same nodes that
|
||
run etcd and the API server). `vps-1624d691` carries the
|
||
`honeydue/redis=true` label so the Redis Deployment's `nodeSelector` binds
|
||
there; the Redis PVC (`local-path`, host-pinned) lives on that node's disk.
|
||
|
||
### SSH access
|
||
|
||
`~/.ssh/config` entries (operator workstation):
|
||
|
||
```
|
||
Host ovhcloud1
|
||
HostName 51.81.83.33
|
||
Port 22
|
||
User ubuntu
|
||
IdentityFile ~/.ssh/ovhcloud
|
||
IdentitiesOnly yes
|
||
Host ovhcloud2
|
||
HostName 51.81.87.86
|
||
Port 22
|
||
User ubuntu
|
||
IdentityFile ~/.ssh/ovhcloud
|
||
IdentitiesOnly yes
|
||
Host ovhcloud3
|
||
HostName 51.81.85.248
|
||
Port 22
|
||
User ubuntu
|
||
IdentityFile ~/.ssh/ovhcloud
|
||
IdentitiesOnly yes
|
||
```
|
||
|
||
`ubuntu` has passwordless sudo (`/etc/sudoers.d/90-cloud-init-users` from OVH's
|
||
cloud-init).
|
||
|
||
### kubectl access
|
||
|
||
```bash
|
||
export KUBECONFIG=/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/deploy-k3s/kubeconfig
|
||
kubectl get nodes
|
||
```
|
||
|
||
The `deploy-k3s/kubeconfig` file (mode 0600, gitignored) is the OVH cluster's
|
||
admin kubeconfig with `server: https://51.81.83.33:6443`. A stale Hetzner copy
|
||
lives next to it as `kubeconfig.hetzner.bak` for historical reference; the
|
||
Hetzner cluster is powered off and that file's API server is unreachable.
|
||
|
||
To refresh from the cluster (if the local copy is lost or rotated):
|
||
|
||
```bash
|
||
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
|
||
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
|
||
> deploy-k3s/kubeconfig
|
||
chmod 600 deploy-k3s/kubeconfig
|
||
```
|
||
|
||
The k3s API at `:6443` is open to the public internet (token-protected).
|
||
|
||
---
|
||
|
||
## 2. Software
|
||
|
||
### Kernel-level
|
||
|
||
| | |
|
||
|---|---|
|
||
| OS | Ubuntu 26.04 LTS (set by OVH's VPS-1 image) |
|
||
| Kernel | `7.0.0-14-generic` |
|
||
| Init | systemd |
|
||
| Container runtime | containerd 2.2.2 (bundled with k3s) |
|
||
| Firewall | `ufw` (per-node, configured at install — see §3) |
|
||
| Other host packages | `fail2ban` (SSH brute-force protection, default jail), `unattended-upgrades` (security updates), `open-iscsi` (k3s prereq for some storage backends), `curl` |
|
||
|
||
### Kubernetes
|
||
|
||
| | |
|
||
|---|---|
|
||
| Distribution | k3s |
|
||
| Version | **`v1.34.6+k3s1`** (pinned in `config.yaml:cluster.k3s_version`) |
|
||
| Control plane | 3-node HA, embedded etcd (no external Postgres backing store) |
|
||
| CNI / networking | flannel with **WireGuard-native backend** (`--flannel-backend=wireguard-native`). Encrypts pod-to-pod and etcd peer traffic because nodes only have public IPs (no private network). ~3-5% CPU overhead under load. |
|
||
| Service LB | klipper-lb (default k3s `servicelb`). The `svclb-traefik` DaemonSet binds host ports `:80` and `:443` on each node and forwards to the Traefik Service. **Not** the DaemonSet-w/-hostNetwork Traefik pattern used on the old Hetzner cluster — see §10 *Differences from MIGRATION_NOTES*. |
|
||
| Ingress controller | Traefik (k3s default), single-replica Deployment, exposed via klipper-lb |
|
||
| DNS | CoreDNS (k3s default) |
|
||
| Secrets encryption | Enabled (`--secrets-encryption`); etcd values are AES-CBC encrypted at rest |
|
||
| kubeconfig perms | `0600` (`--write-kubeconfig-mode=0600`) |
|
||
| Cloud controller | Disabled (`--disable-cloud-controller`) — no provider integration on OVH |
|
||
| Misc | `--node-ip` / `--node-external-ip` / `--advertise-address` all set to each node's public IPv4. TLS SANs cover all 3 IPs so any IP can serve the API. |
|
||
|
||
### Application stack (in cluster, `honeydue` namespace)
|
||
|
||
| Deployment | Replicas | Image (digest-pinned) | Notes |
|
||
|---|---:|---|---|
|
||
| `api` | 3 | `gitea.treytartt.com/admin/honeydue-api@sha256:34fde6...` | Go REST API on `:8000`, exposes `/metrics` |
|
||
| `web` | 3 | `gitea.treytartt.com/admin/honeydue-web@sha256:8c62cf...` | Next.js, server-side proxy to api |
|
||
| `admin` | 1 | `gitea.treytartt.com/admin/honeydue-admin@sha256:b81263...` | Next.js admin panel, gated behind Traefik basic-auth |
|
||
| `worker` | 1 | `gitea.treytartt.com/admin/honeydue-worker@sha256:fe1f5e...` | Asynq scheduler + Redis-backed jobs (singleton — must not run as >1 replica or every cron fires N×) |
|
||
| `redis` | 1 | `redis:7-alpine@sha256:6ab0b6...` | Pinned to `vps-1624d691` via `honeydue/redis=true`. PVC `redis-data` (local-path, 5 Gi). Password-auth required. |
|
||
| `vmagent` | 1 | `victoriametrics/vmagent@sha256:...` (default tag) | Scrapes api `/metrics` + kube-state-metrics; remote-writes to obs.88oakapps.com |
|
||
| `kube-state-metrics` | 1 | `kube-state-metrics@sha256:...` | In `kube-system`, scraped by vmagent for `kube_*` cluster-state metrics |
|
||
| `alloy-logs` (DaemonSet) | 3 (1/node) | `grafana/alloy@sha256:...` | Tails `/var/log/pods/*` and ships to Loki at obs.88oakapps.com |
|
||
|
||
The Asynq scheduler inside `worker` registers these cron jobs:
|
||
|
||
| Cron | Job | Notes |
|
||
|---|---|---|
|
||
| `0 * * * *` | Smart reminder check (per-user hour) | Default user hour: 14:00 UTC |
|
||
| `0 * * * *` | Daily digest check (per-user hour) | Default user hour: 03:00 UTC |
|
||
| `0 10 * * *` | Onboarding emails | 10:00 UTC |
|
||
| `0 3 * * *` | Reminder log cleanup | 03:00 UTC |
|
||
| `30 * * * *` | Pending uploads cleanup | xx:30 every hour |
|
||
|
||
### External dependencies
|
||
|
||
| Service | Endpoint | Purpose | Failure mode |
|
||
|---|---|---|---|
|
||
| Neon Postgres | `ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech:5432` | App data. Pooler endpoint (transaction-mode PgBouncer in front of Neon compute) so connections stay warm. | api / worker pods crash-loop with `dial tcp: connection refused`. Health endpoint returns `postgres: error`. |
|
||
| Backblaze B2 (S3-compatible) | `s3.us-east-005.backblazeb2.com` (bucket `honeyDueProd`) | User uploads (photos, PDFs, completion attachments) | Upload routes return 5xx; reads of cached/static files still work. |
|
||
| Cloudflare | `myhoneydue.com` zone | DNS + TLS termination + edge cache + DDoS | Traffic stops reaching origin. Direct `https://51.81.x.x` still works for diagnostics. |
|
||
| obs.88oakapps.com | Operator-run Grafana + VictoriaMetrics + Loki | Metrics & logs | vmagent + alloy-logs back off and retry. No app-side impact. |
|
||
| Apple APNs | `api.push.apple.com:443` (production) | iOS push notifications | Push fails; circuit breaker opens; failure logged. App functionality unaffected. |
|
||
| Fastmail SMTP | `smtp.fastmail.com:587` | Transactional emails (verification, recovery, digests) | Email send fails in the worker; logged; user reset/digest flow degrades. |
|
||
| Gitea registry | `gitea.treytartt.com` | Container image registry | Deploys can't pull. Existing pods keep running on cached images. |
|
||
|
||
---
|
||
|
||
## 3. Network and firewall
|
||
|
||
### Per-node `ufw` configuration
|
||
|
||
Applied during install (same on all 3 nodes):
|
||
|
||
```
|
||
default deny incoming
|
||
default allow outgoing
|
||
allow 22/tcp (SSH, world)
|
||
allow 80/tcp (HTTP via Cloudflare, world — see GAP-1)
|
||
allow 443/tcp (HTTPS, same — GAP-1)
|
||
allow 6443/tcp (k3s API, world, token-protected)
|
||
allow 2379:2380/tcp from <other 2 OVH IPs> (etcd client + peer)
|
||
allow 10250/tcp from <other 2 OVH IPs> (kubelet)
|
||
allow 51820/udp from <other 2 OVH IPs> (WireGuard tunnel)
|
||
allow 8472/udp from <other 2 OVH IPs> (VXLAN, defense-in-depth fallback)
|
||
```
|
||
|
||
To inspect: `ssh ovhcloudN sudo ufw status numbered`.
|
||
|
||
### Cluster networking
|
||
|
||
- **Pod CIDR**: `10.42.0.0/16` (default k3s)
|
||
- **Service CIDR**: `10.43.0.0/16` (default k3s)
|
||
- **Flannel backend**: WireGuard-native. Each node hosts a `flannel-wg` interface on UDP 51820 and tunnels pod traffic to peers. Verify: `ssh ovhcloudN ip -d link show flannel-wg`.
|
||
|
||
### Traefik ingress flow
|
||
|
||
```
|
||
Cloudflare → node:80/443 (public)
|
||
→ klipper-lb svclb-traefik DaemonSet pod (hostPort:80/443)
|
||
→ Traefik Service (ClusterIP 10.43.245.127:80/443)
|
||
→ Traefik Deployment pod (single replica)
|
||
→ matches Ingress host rule (api.myhoneydue.com etc.)
|
||
→ routes to backend Service (api / web / admin)
|
||
→ backend Pod
|
||
```
|
||
|
||
The Traefik default also lives in `kube-system` and is managed by k3s's
|
||
HelmChart. **No HelmChartConfig override is applied on OVH** (unlike Hetzner
|
||
— see §10).
|
||
|
||
---
|
||
|
||
## 4. DNS configuration (Cloudflare)
|
||
|
||
The `myhoneydue.com` zone in Cloudflare has these public records. **All
|
||
hostnames are proxied (orange cloud)** — required by the `cloudflare-only`
|
||
Traefik middleware which 403s any non-CF source IP.
|
||
|
||
| Host | Type | Values | Proxy |
|
||
|---|---|---|---|
|
||
| `api.myhoneydue.com` | A × 3 | `51.81.83.33`, `51.81.87.86`, `51.81.85.248` | Proxied |
|
||
| `app.myhoneydue.com` | A × 3 | (same trio) | Proxied |
|
||
| `admin.myhoneydue.com` | A × 3 | (same trio) | Proxied |
|
||
| `myhoneydue.com` (apex `@`) | A × 3 | (same trio) | Proxied |
|
||
|
||
Cloudflare round-robins among the 3 origins, klipper-lb on whichever node CF
|
||
hits forwards to Traefik, and Traefik routes by Host header. Per-request,
|
||
effectively load-balanced across the 3 nodes for ingress, with no central LB.
|
||
|
||
**SSL/TLS mode**: Flexible (CF terminates TLS at the edge; origin is plain
|
||
HTTP on `:80`). Upgrading to Full (strict) is on the deferred list — would
|
||
need an origin certificate provisioned to `cloudflare-origin-cert` secret and
|
||
Traefik configured for TLS termination.
|
||
|
||
---
|
||
|
||
## 5. Filesystem layout (`deploy-k3s/`)
|
||
|
||
```
|
||
deploy-k3s/
|
||
├── config.yaml # Single config source (gitignored; contains tokens)
|
||
├── config.yaml.example # Template
|
||
├── kubeconfig # OVH admin kubeconfig (gitignored, 0600)
|
||
├── kubeconfig.hetzner.bak # Old Hetzner kubeconfig (unreachable, kept for history)
|
||
├── kubeconfig.tunnel # Optional: localhost-pointing copy for SSH-tunnel use
|
||
├── secrets/
|
||
│ ├── README.md
|
||
│ ├── postgres_password.txt # Neon DB password
|
||
│ ├── secret_key.txt # 32+ char app-token signing secret
|
||
│ ├── email_host_password.txt # Fastmail SMTP app password
|
||
│ ├── fcm_server_key.txt # FCM server key (currently unused — Android push disabled)
|
||
│ ├── apns_auth_key.p8 # APNs auth key (binary)
|
||
│ ├── cloudflare-origin.crt # Origin certificate (currently unused — CF Flexible)
|
||
│ └── cloudflare-origin.key
|
||
│ (all gitignored except README.md)
|
||
├── manifests/
|
||
│ ├── namespace.yaml
|
||
│ ├── network-policies.yaml # default-deny + per-app egress/ingress (13 NetPols total)
|
||
│ ├── rbac.yaml # api/worker/admin/web/redis ServiceAccounts (NOT applied by 03-deploy.sh; manual once)
|
||
│ ├── pod-disruption-budgets.yaml # api-pdb, web-pdb, worker-pdb (NOT applied by 03-deploy.sh; manual once)
|
||
│ ├── traefik-helmchartconfig.yaml # Hetzner-only DaemonSet+hostNetwork override (do NOT apply on OVH; we use default klipper-lb)
|
||
│ ├── kyverno-verify-images.yaml # Operator-gated policy (do NOT apply blindly — see file comment)
|
||
│ ├── api/{deployment,service,hpa}.yaml
|
||
│ ├── worker/deployment.yaml
|
||
│ ├── admin/{deployment,service}.yaml
|
||
│ ├── web/{deployment,service}.yaml
|
||
│ ├── redis/{deployment,service,pvc}.yaml
|
||
│ ├── ingress/{middleware,ingress-simple}.yaml
|
||
│ ├── migrate/job.yaml # goose migration Job (image-subbed at deploy time)
|
||
│ ├── observability/{kube-state-metrics,vmagent,alloy-logs}.yaml
|
||
│ └── kratos/ # Ory Kratos identity service (NOT yet deployed; gated on operator OIDC setup)
|
||
└── scripts/
|
||
├── _config.sh # Sourced by all scripts: cfg(), generate_env(), generate_cluster_config()
|
||
├── 01-provision-cluster.sh # Hetzner-Cloud-specific (uses hetzner-k3s CLI) — DO NOT RUN ON OVH
|
||
├── 02-setup-secrets.sh # Creates honeydue-secrets etc. from secrets/ + config.yaml; kubeconfig-driven
|
||
├── 03-deploy.sh # Build + push + apply manifests + roll deployments; kubeconfig-driven
|
||
├── 04-verify.sh # Post-deploy health + security checks; kubeconfig-driven
|
||
└── rollback.sh # `kubectl rollout undo` across all deployments
|
||
```
|
||
|
||
The `deploy/prod.env` file (sibling to `deploy-k3s/`, gitignored) holds
|
||
observability + admin credentials that `02/03-deploy.sh` read but never
|
||
display:
|
||
|
||
```
|
||
OBS_INGEST_URL (https://obs.88oakapps.com/api/v1/write)
|
||
OBS_TRACES_URL (https://obs.88oakapps.com/v1/traces)
|
||
OBS_INGEST_TOKEN (bearer token for VM + Loki + traces — all use same token)
|
||
GRAFANA_URL (https://grafana.88oakapps.com)
|
||
GRAFANA_ADMIN_USER (admin)
|
||
GRAFANA_ADMIN_PASSWORD
|
||
ADMIN_EMAIL / ADMIN_PASSWORD (in-app admin login)
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Install from clean boxes — the truthful procedure
|
||
|
||
This is what we ran on 2026-06-03 to stand up the live cluster, exactly. If
|
||
you ever rebuild from zero this is the canonical sequence. Total wall-clock:
|
||
~12 min for cluster bootstrap; ~10 min for workloads.
|
||
|
||
### 6.1 Prerequisites
|
||
|
||
- 3 fresh Ubuntu VPS instances (any provider with public IPv4, ≥4 GB RAM,
|
||
≥40 GB disk)
|
||
- `~/.ssh/config` entries (`ovhcloud1/2/3`) pointing at them, with
|
||
passwordless sudo
|
||
- Local `kubectl` and `curl`
|
||
- The repo's `deploy-k3s/secrets/` populated (or the ability to copy live
|
||
secrets from another running cluster — see §7.2)
|
||
- `deploy/prod.env` populated with obs token + Grafana creds
|
||
|
||
### 6.2 Per-node OS hardening + firewall (all 3 in parallel)
|
||
|
||
For each `ovhcloudN`, over SSH:
|
||
|
||
```sh
|
||
export DEBIAN_FRONTEND=noninteractive
|
||
sudo apt-get update -qq
|
||
sudo apt-get install -y -qq fail2ban unattended-upgrades open-iscsi curl ufw
|
||
sudo systemctl enable --now iscsid fail2ban
|
||
sudo dpkg-reconfigure -f noninteractive -plow unattended-upgrades
|
||
|
||
sudo ufw --force reset
|
||
sudo ufw default deny incoming
|
||
sudo ufw default allow outgoing
|
||
sudo ufw allow 22/tcp
|
||
sudo ufw allow 80/tcp
|
||
sudo ufw allow 443/tcp
|
||
sudo ufw allow 6443/tcp
|
||
SELF=$(hostname -I | awk '{print $1}')
|
||
for peer in 51.81.83.33 51.81.87.86 51.81.85.248; do
|
||
[ "$peer" = "$SELF" ] && continue
|
||
sudo ufw allow from "$peer" to any port 2379:2380 proto tcp
|
||
sudo ufw allow from "$peer" to any port 10250 proto tcp
|
||
sudo ufw allow from "$peer" to any port 51820 proto udp
|
||
sudo ufw allow from "$peer" to any port 8472 proto udp
|
||
done
|
||
sudo ufw --force enable
|
||
```
|
||
|
||
**Watch ordering:** `allow 22/tcp` MUST precede `ufw enable`. Existing SSH
|
||
sessions survive (`ufw` only affects new connections), but a misordered script
|
||
locks you out of fresh logins.
|
||
|
||
### 6.3 Install k3s on `ovhcloud1` (the init node)
|
||
|
||
```sh
|
||
ssh ovhcloud1 'curl -sfL https://get.k3s.io | \
|
||
INSTALL_K3S_VERSION=v1.34.6+k3s1 \
|
||
sh -s - server \
|
||
--cluster-init \
|
||
--node-ip=51.81.83.33 \
|
||
--node-external-ip=51.81.83.33 \
|
||
--advertise-address=51.81.83.33 \
|
||
--flannel-backend=wireguard-native \
|
||
--flannel-external-ip \
|
||
--secrets-encryption \
|
||
--write-kubeconfig-mode=0600 \
|
||
--tls-san=51.81.83.33 \
|
||
--tls-san=51.81.87.86 \
|
||
--tls-san=51.81.85.248 \
|
||
--disable-cloud-controller'
|
||
```
|
||
|
||
Wait for `sudo k3s kubectl get nodes` to show this node Ready (~2-5 s).
|
||
Read the cluster token:
|
||
|
||
```sh
|
||
ssh ovhcloud1 'sudo cat /var/lib/rancher/k3s/server/node-token'
|
||
```
|
||
|
||
### 6.4 Join `ovhcloud2`, then `ovhcloud3` (sequential)
|
||
|
||
Joining etcd one node at a time avoids split-brain on slow networks.
|
||
Replace `<TOKEN>` with the value from 6.3.
|
||
|
||
For `ovhcloud2`:
|
||
|
||
```sh
|
||
ssh ovhcloud2 'curl -sfL https://get.k3s.io | \
|
||
INSTALL_K3S_VERSION=v1.34.6+k3s1 \
|
||
K3S_TOKEN=<TOKEN> \
|
||
sh -s - server \
|
||
--server=https://51.81.83.33:6443 \
|
||
--node-ip=51.81.87.86 \
|
||
--node-external-ip=51.81.87.86 \
|
||
--advertise-address=51.81.87.86 \
|
||
--flannel-backend=wireguard-native \
|
||
--flannel-external-ip \
|
||
--secrets-encryption \
|
||
--write-kubeconfig-mode=0600 \
|
||
--tls-san=51.81.83.33 --tls-san=51.81.87.86 --tls-san=51.81.85.248 \
|
||
--disable-cloud-controller'
|
||
```
|
||
|
||
Then identical for `ovhcloud3` with `--node-ip=51.81.85.248` and
|
||
`--advertise-address=51.81.85.248`. After each, wait for `kubectl get nodes`
|
||
to show the new node Ready before proceeding.
|
||
|
||
### 6.5 Pull kubeconfig to the operator workstation
|
||
|
||
```sh
|
||
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
|
||
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
|
||
> deploy-k3s/kubeconfig
|
||
chmod 600 deploy-k3s/kubeconfig
|
||
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
|
||
kubectl get nodes -o wide # All 3 Ready, INTERNAL-IP = public IP
|
||
```
|
||
|
||
### 6.6 Label the redis node
|
||
|
||
```sh
|
||
kubectl label node vps-1624d691 honeydue/redis=true --overwrite
|
||
```
|
||
|
||
(Use whichever k8s node name corresponds to `ovhcloud1`. The Redis
|
||
Deployment's `nodeSelector` binds to this label.)
|
||
|
||
### 6.7 Bootstrap manifests NOT applied by `03-deploy.sh`
|
||
|
||
These must be applied manually on a fresh cluster, **before** running
|
||
`03-deploy.sh`, or workloads will fail to schedule:
|
||
|
||
```sh
|
||
kubectl apply -f deploy-k3s/manifests/rbac.yaml
|
||
kubectl apply -f deploy-k3s/manifests/pod-disruption-budgets.yaml
|
||
```
|
||
|
||
`rbac.yaml` creates the 5 ServiceAccounts (`api`, `worker`, `admin`, `web`,
|
||
`redis`) referenced by the Deployment manifests. Without these, ReplicaSets
|
||
hang on `FailedCreate: error looking up service account` and pods never
|
||
start. Symptom on first deploy: `kubectl get deploy` shows `0 up-to-date`
|
||
across the board with no pod activity — see §9 *Gotchas*.
|
||
|
||
**Do NOT apply** `traefik-helmchartconfig.yaml` (Hetzner-only — see §10) or
|
||
`kyverno-verify-images.yaml` (gated on operator Kyverno install).
|
||
|
||
### 6.8 Seed secrets
|
||
|
||
Two paths; pick whichever fits your situation:
|
||
|
||
**Path A — clean install from local files** (the original design):
|
||
|
||
```sh
|
||
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/02-setup-secrets.sh
|
||
```
|
||
|
||
Requires `deploy-k3s/secrets/` to contain real `postgres_password.txt`,
|
||
`secret_key.txt`, `email_host_password.txt`, `fcm_server_key.txt`,
|
||
`apns_auth_key.p8`, `cloudflare-origin.crt`, `cloudflare-origin.key`. The
|
||
script reads `config.yaml` for `registry.*`, `redis.password`,
|
||
`admin.basic_auth_*`, and `storage.b2_*`.
|
||
|
||
**Path B — clone live secrets from another running cluster** (what we
|
||
actually did during the migration; useful if `secrets/` is empty or you want
|
||
exact-byte equivalence):
|
||
|
||
```sh
|
||
HETZNER=$(pwd)/deploy-k3s/kubeconfig.hetzner.bak # or any kubeconfig with the secrets
|
||
OVH=$(pwd)/deploy-k3s/kubeconfig
|
||
kubectl --kubeconfig=$OVH apply -f deploy-k3s/manifests/namespace.yaml
|
||
for S in honeydue-secrets honeydue-apns-key gitea-credentials cloudflare-origin-cert admin-basic-auth; do
|
||
kubectl --kubeconfig=$HETZNER -n honeydue get secret $S -o json \
|
||
| python3 -c "
|
||
import json, sys
|
||
d = json.load(sys.stdin)
|
||
m = d['metadata']
|
||
for k in ('uid','resourceVersion','creationTimestamp','generation','managedFields','ownerReferences','selfLink'):
|
||
m.pop(k, None)
|
||
m.pop('annotations', None)
|
||
print(json.dumps(d))" \
|
||
| kubectl --kubeconfig=$OVH apply -f -
|
||
done
|
||
```
|
||
|
||
After either path, verify:
|
||
|
||
```sh
|
||
kubectl -n honeydue get secrets
|
||
# Expect: admin-basic-auth, cloudflare-origin-cert, gitea-credentials,
|
||
# honeydue-apns-key, honeydue-secrets
|
||
```
|
||
|
||
### 6.9 Deploy workloads
|
||
|
||
```sh
|
||
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig \
|
||
./deploy-k3s/scripts/03-deploy.sh --skip-build --tag latest
|
||
```
|
||
|
||
- `--skip-build` skips Docker build + push, deploys whatever's already in the
|
||
registry at the named tag. Use this when migrating between clusters to
|
||
guarantee both run identical bits.
|
||
- Without flags it builds the api / worker / admin / web images from the
|
||
local repo HEAD and pushes to `gitea.treytartt.com` first.
|
||
- The script applies (in order): namespace, network-policies (13 of them),
|
||
redis, ingress, then runs the goose migration Job (blocking on success),
|
||
then api / worker / admin / web Deployments, then observability
|
||
(kube-state-metrics, vmagent, alloy-logs).
|
||
- It does NOT apply: `rbac.yaml`, `pod-disruption-budgets.yaml`,
|
||
`traefik-helmchartconfig.yaml`, `kyverno-verify-images.yaml`. The first
|
||
two must be applied manually (see §6.7); the latter two are Hetzner-only
|
||
or operator-gated.
|
||
- It does NOT apply: anything under `kratos/` (skipped until
|
||
`kratos-secrets` exists, which requires real OIDC client IDs).
|
||
|
||
### 6.10 Verify
|
||
|
||
```sh
|
||
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/04-verify.sh
|
||
```
|
||
|
||
Expect: all deployments `READY=desired`, 13 NetworkPolicies, 7 ServiceAccounts
|
||
(api, worker, admin, web, redis, vmagent, alloy-logs), 3 PDBs, cloudflare-only
|
||
middleware present, in-cluster `/api/health/` returns 200.
|
||
|
||
External smoke test (DNS-aware, but the api `/health/` route is exempt from
|
||
the cloudflare-only middleware so direct-IP works for diagnostics):
|
||
|
||
```sh
|
||
for IP in 51.81.83.33 51.81.87.86 51.81.85.248; do
|
||
curl -s -o /dev/null -w "$IP -> %{http_code}\n" \
|
||
-H 'Host: api.myhoneydue.com' http://$IP/api/health/
|
||
done
|
||
# All three should return 200.
|
||
```
|
||
|
||
### 6.11 DNS cutover (if migrating)
|
||
|
||
In the Cloudflare dashboard for `myhoneydue.com`, set the 4 hostnames in §4 to
|
||
the OVH IPs and keep proxied. Effective propagation ~30 s to 5 min through
|
||
the Cloudflare proxy.
|
||
|
||
If you have a previous cluster, **scale its worker to 0 before flipping** to
|
||
avoid scheduled-job double-fires:
|
||
|
||
```sh
|
||
KUBECONFIG=<previous> kubectl -n honeydue scale deploy/worker --replicas=0
|
||
# (cut DNS)
|
||
KUBECONFIG=<new> kubectl -n honeydue scale deploy/worker --replicas=1
|
||
```
|
||
|
||
Run those last two lines back-to-back. Worker work is mostly scheduled
|
||
(hourly+), so a brief gap is harmless; overlap would cause duplicate emails.
|
||
|
||
---
|
||
|
||
## 7. Day-to-day operations
|
||
|
||
### Common kubectl one-liners
|
||
|
||
```sh
|
||
export KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig
|
||
|
||
# Cluster state
|
||
kubectl get nodes -o wide
|
||
kubectl -n honeydue get pods
|
||
kubectl -n honeydue get deploy
|
||
kubectl top nodes
|
||
kubectl -n honeydue top pods
|
||
|
||
# Tail logs
|
||
kubectl -n honeydue logs deploy/api -f --tail=50
|
||
kubectl -n honeydue logs -l app.kubernetes.io/name=api -f --tail=20
|
||
stern -n honeydue api # if stern is installed (multi-pod)
|
||
|
||
# Restart a deployment (no image change, picks up ConfigMap changes)
|
||
kubectl -n honeydue rollout restart deploy/api
|
||
|
||
# Rollback one revision
|
||
kubectl -n honeydue rollout undo deploy/api
|
||
|
||
# Scale (worker MUST stay at 0 or 1)
|
||
kubectl -n honeydue scale deploy/api --replicas=4
|
||
|
||
# Get into a pod
|
||
kubectl -n honeydue exec -it deploy/api -- sh
|
||
```
|
||
|
||
### Redeploy after code changes
|
||
|
||
```sh
|
||
KUBECONFIG=$(pwd)/deploy-k3s/kubeconfig ./deploy-k3s/scripts/03-deploy.sh
|
||
```
|
||
|
||
Builds images from local HEAD, tags with the git short SHA, pushes to Gitea,
|
||
runs `goose up` (idempotent), rolls api/worker/admin/web. Total: ~3-5 min
|
||
when images change.
|
||
|
||
To deploy without rebuilding (pin to a specific tag):
|
||
|
||
```sh
|
||
./deploy-k3s/scripts/03-deploy.sh --skip-build --tag <tag-or-:latest>
|
||
```
|
||
|
||
### Migrations
|
||
|
||
Goose migrations live in `migrations/`. New file pattern:
|
||
|
||
```
|
||
make migrate-new name=add_foo_column # generates migrations/YYYYMMDDHHMMSS_add_foo_column.sql
|
||
# Edit the file with -- +goose Up / -- +goose Down sections
|
||
```
|
||
|
||
`03-deploy.sh` runs a one-shot Job (`manifests/migrate/job.yaml`) that
|
||
executes `goose up` against Neon (direct compute endpoint, not pooler — see
|
||
file comment). The Job blocks api/worker rollout and aborts the deploy on
|
||
failure. No app pod runs `AutoMigrate`; api/worker startup verifies
|
||
`goose_db_version` is current and refuses to boot on mismatch.
|
||
|
||
### Grafana
|
||
|
||
URL: https://grafana.88oakapps.com (creds in `deploy/prod.env`)
|
||
|
||
Three dashboards in the `honeyDue` folder:
|
||
|
||
| UID | Title | Use |
|
||
|---|---|---|
|
||
| `honeydue-eli5-overview` | honeyDue — Overview (ELI5) | Single-screen at-a-glance health: pods up, crashes, errors, RPS, latency, Postgres, memory, top endpoints, push failures, worker activity, recent error logs. Created 2026-06-03. |
|
||
| `honeydue-red` | honeyDue API — RED | Rate/Errors/Duration cuts (legacy) |
|
||
| `honeydue-logs` | honeyDue — Production Logs | Live log explorer |
|
||
|
||
For the ELI5 dashboard's queries, **api-side metrics use `service="api"`,
|
||
NOT `namespace="honeydue"`.** vmagent's scrape config drops the namespace
|
||
label from api metrics — only `service`, `pod`, `node`, `job`, plus the
|
||
metric's own labels (route, method, status, etc.) survive. Queries that
|
||
filter on `namespace="honeydue"` for api metrics silently match nothing.
|
||
|
||
### kubectl tunnel (if 6443 is firewalled to your IP)
|
||
|
||
Currently `6443` is open WAN-side (matching the previous Hetzner posture).
|
||
If you tighten that to operator-IPs-only and your IP changes, use an SSH
|
||
tunnel:
|
||
|
||
```sh
|
||
ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
|
||
-i ~/.ssh/ovhcloud \
|
||
-L 127.0.0.1:6443:127.0.0.1:6443 \
|
||
ubuntu@51.81.83.33
|
||
|
||
cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
|
||
sed -i.bak 's|https://51.81.83.33:6443|https://127.0.0.1:6443|' deploy-k3s/kubeconfig.tunnel
|
||
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Disaster recovery
|
||
|
||
### "I lost the kubeconfig"
|
||
|
||
```sh
|
||
ssh ovhcloud1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
|
||
| sed 's|server: https://127.0.0.1:6443|server: https://51.81.83.33:6443|' \
|
||
> deploy-k3s/kubeconfig
|
||
chmod 600 deploy-k3s/kubeconfig
|
||
```
|
||
|
||
If `ovhcloud1` is down but `ovhcloud2` or `3` is up, swap host and IP — the
|
||
TLS SAN covers all three.
|
||
|
||
### "A node is unresponsive"
|
||
|
||
```sh
|
||
kubectl drain vps-XXX --ignore-daemonsets --delete-emptydir-data
|
||
# Reboot via OVH manager or:
|
||
ssh ovhcloudN sudo reboot
|
||
# Wait for Ready, then:
|
||
kubectl uncordon vps-XXX
|
||
```
|
||
|
||
The cluster tolerates 1 node down (etcd quorum 2/3). With 2 down, etcd
|
||
loses quorum and the API server stops accepting writes.
|
||
|
||
### "etcd quorum lost (2+ nodes dead)"
|
||
|
||
Bring nodes back online if possible. If not:
|
||
|
||
```sh
|
||
ssh ovhcloud1 'sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<latest>'
|
||
```
|
||
|
||
k3s takes automatic etcd snapshots every 12h, keeping 5. List with:
|
||
|
||
```sh
|
||
ssh ovhcloud1 sudo ls -la /var/lib/rancher/k3s/server/db/snapshots/
|
||
```
|
||
|
||
This is destructive — workload state since the snapshot is lost, but Neon
|
||
(actual app data) is unaffected.
|
||
|
||
### "I have to rebuild the whole cluster from scratch"
|
||
|
||
Provision 3 fresh boxes, then exactly the sequence in §6. End-to-end is
|
||
~30 min. The dependencies that make this possible:
|
||
|
||
| Stays put through rebuild | Where |
|
||
|---|---|
|
||
| Application data | Neon Postgres (managed) |
|
||
| User uploads | Backblaze B2 (managed) |
|
||
| Container images | `gitea.treytartt.com` (self-hosted, but not on the OVH cluster) |
|
||
| Operator secrets | `deploy-k3s/secrets/` + `config.yaml` + `deploy/prod.env` on the operator workstation (gitignored) |
|
||
| DNS | Cloudflare control panel |
|
||
|
||
If `gitea.treytartt.com` is on the same OVH cluster, you have a circular
|
||
dependency — rebuilding requires images you can't pull until the cluster is
|
||
up. Currently Gitea is NOT in the honeyDue cluster (separate Hetzner-era
|
||
host), so this isn't a problem today, but worth flagging if that ever
|
||
changes.
|
||
|
||
### "Cutover back to Hetzner / failover to a backup cluster"
|
||
|
||
There is **no warm standby today.** Bringing up a second cluster is the
|
||
same §6 procedure on different hardware, then a Cloudflare DNS swap. The
|
||
worker-swap dance is critical:
|
||
|
||
```sh
|
||
KUBECONFIG=<current> kubectl -n honeydue scale deploy/worker --replicas=0
|
||
# (Update Cloudflare DNS to new cluster's IPs — proxied)
|
||
KUBECONFIG=<new> kubectl -n honeydue scale deploy/worker --replicas=1
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Known gotchas
|
||
|
||
### 9.1 First-deploy "0 up-to-date" across all Deployments
|
||
|
||
**Symptoms:** `kubectl get deploy` shows `READY 0/N, UP-TO-DATE 0` for
|
||
api/worker/admin/web/redis. `kubectl get events` shows
|
||
`FailedCreate: error looking up service account honeydue/<name>: serviceaccount "..." not found`.
|
||
|
||
**Cause:** `rbac.yaml` (ServiceAccounts) is NOT applied by `03-deploy.sh`. On
|
||
a fresh cluster the SAs don't exist; the ReplicaSet controller can't create
|
||
pods.
|
||
|
||
**Fix:**
|
||
|
||
```sh
|
||
kubectl apply -f deploy-k3s/manifests/rbac.yaml
|
||
kubectl -n honeydue rollout restart deploy/api deploy/worker deploy/admin deploy/web deploy/redis
|
||
```
|
||
|
||
This was hit during the 2026-06-03 OVH bootstrap. Permanently fix by adding
|
||
`kubectl apply -f rbac.yaml` to `03-deploy.sh` between the namespace and
|
||
network-policies apply, but until that lands, follow §6.7 on every fresh
|
||
cluster.
|
||
|
||
### 9.2 vmagent SD broken on fresh deploy ("0 pods up" in Grafana)
|
||
|
||
**Symptoms:**
|
||
- Grafana panels using `kube_*` metrics or `up{job=...}` show 0
|
||
- vmagent logs: `dial tcp 10.43.0.1:443: connect: connection refused` every ~30 s
|
||
- Direct test from a pod also refused
|
||
|
||
**Cause:** k3s's NetworkPolicy controller evaluates egress rules *after*
|
||
kube-proxy's DNAT (not before, contrary to spec). Pod-to-`kubernetes`-Service
|
||
(`10.43.0.1:443`) gets DNAT'd to `<node_ip>:6443`, *then* the policy check
|
||
runs. Without an explicit egress rule for `:6443`, the packet is rejected.
|
||
|
||
The `allow-egress-from-vmagent` NetPol in `network-policies.yaml` includes
|
||
both rules:
|
||
|
||
```yaml
|
||
- to:
|
||
- ipBlock: { cidr: 10.43.0.0/16 }
|
||
ports:
|
||
- { port: 443, protocol: TCP }
|
||
- to:
|
||
- ipBlock:
|
||
cidr: 0.0.0.0/0
|
||
except: [10.42.0.0/16]
|
||
ports:
|
||
- { port: 6443, protocol: TCP }
|
||
```
|
||
|
||
**If this happens:** confirm `network-policies.yaml` was applied:
|
||
|
||
```sh
|
||
kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443
|
||
```
|
||
|
||
Counter-evidence that confirms diagnosis: `kube-state-metrics` in
|
||
`kube-system` works fine (no NetPols in that namespace).
|
||
|
||
### 9.3 vmagent appears healthy but no data in Grafana
|
||
|
||
vmagent's `/-/healthy` returns 200 as long as the process is alive and
|
||
remote-write is TCP-functional. It doesn't check that scrapes are actually
|
||
*succeeding*. The liveness probe in `vmagent.yaml` queries `/api/v1/targets`
|
||
and fails the pod if no target is `up`. After ~3 failures (~3 min), kubelet
|
||
recycles it.
|
||
|
||
If vmagent runs for weeks but Grafana is empty, the probe was disabled or
|
||
the exec command broke.
|
||
|
||
### 9.4 vmagent bearer token destroyed by direct `kubectl apply`
|
||
|
||
The committed `vmagent.yaml` has `bearer_token: TOKEN_PLACEHOLDER`. The real
|
||
token is `sed`-substituted at deploy time by `03-deploy.sh`. Applying the
|
||
file directly:
|
||
|
||
```sh
|
||
kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml # WRONG
|
||
```
|
||
|
||
overwrites the Secret with the literal `TOKEN_PLACEHOLDER` and remote-writes
|
||
401. To restore without a full redeploy:
|
||
|
||
```sh
|
||
OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
|
||
-o jsonpath='{.data.OBS_INGEST_TOKEN}')
|
||
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
|
||
-p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
|
||
kubectl -n honeydue rollout restart deploy/vmagent
|
||
```
|
||
|
||
Or just re-run `./deploy-k3s/scripts/03-deploy.sh` — the sed handles it.
|
||
|
||
### 9.5 Dashboard queries: api metrics need `service="api"` not `namespace="honeydue"`
|
||
|
||
vmagent's scrape config (`vmagent-config` ConfigMap) explicitly chooses which
|
||
Kubernetes pod-metadata labels to copy onto each scraped series. **Namespace
|
||
isn't one of them.** Labels you can use on api-side metrics:
|
||
|
||
- `service` (literal `"api"`)
|
||
- `job` (literal `"api"`)
|
||
- `pod` (the api pod name)
|
||
- `node` (the k8s node name)
|
||
- `cluster` (vmagent external_label, currently `"honeydue-k3s"`)
|
||
- `environment` (vmagent external_label, currently `"prod"`)
|
||
- Plus each metric's own labels (`method`, `route`, `status` for HTTP; etc.)
|
||
|
||
`kube_*` metrics from kube-state-metrics DO carry `namespace` natively
|
||
(KSM publishes it as a label, vmagent passes it through). Loki streams have
|
||
`namespace` because alloy-logs explicitly relabels it. So the rule is:
|
||
|
||
| Metric prefix | Use |
|
||
|---|---|
|
||
| `kube_*` | `namespace="honeydue"` |
|
||
| `http_*`, `gorm_*`, `go_*`, `process_*` (api) | `service="api"` |
|
||
| Loki logs `{...}` | `namespace="honeydue"` |
|
||
|
||
### 9.6 Cluster-label collision when two clusters run together
|
||
|
||
Both Hetzner and OVH vmagents push as `cluster=honeydue-k3s, environment=prod`
|
||
(same external_labels). During the migration overlap this made dashboards
|
||
sum both clusters' data. The simplest narrowing during overlap is by node
|
||
name pattern (`node=~"vps-.*"` for OVH, `node=~"ubuntu-.*"` for Hetzner). If
|
||
you ever bring up a backup cluster long-term, change one cluster's
|
||
`external_labels.cluster` to something distinct (e.g. `honeydue-ovh`
|
||
vs. `honeydue-backup`).
|
||
|
||
### 9.7 Worker double-firing scheduled jobs
|
||
|
||
If two `worker` Deployments run concurrently (e.g. two clusters both pointing
|
||
at the same Neon DB), Asynq schedulers each fire crons independently — users
|
||
get duplicate emails. Workaround: scale all-but-one worker to 0. This is the
|
||
exact mechanic used during cutovers (§6.11).
|
||
|
||
### 9.8 Node kubeconfig mode
|
||
|
||
`/etc/rancher/k3s/k3s.yaml` on each node is mode `0600` because we install
|
||
with `--write-kubeconfig-mode=0600`. Tightening from k3s default (0644) was
|
||
intentional. Don't change without coordinating — any tooling on the node
|
||
that expects to read it (none today) will break.
|
||
|
||
---
|
||
|
||
## 10. Differences from MIGRATION_NOTES.md (Hetzner-era)
|
||
|
||
`MIGRATION_NOTES.md` documents the Swarm → k3s migration on Hetzner
|
||
(2026-04-24). Most of it still applies, with these OVH-specific deltas:
|
||
|
||
| What MIGRATION_NOTES says | What OVH actually has |
|
||
|---|---|
|
||
| `hetzner-k3s` provisioner | Manual k3s install (§6) |
|
||
| Hetzner Load Balancer (not used) → Cloudflare round-robin | Same — Cloudflare round-robin (§4) |
|
||
| Traefik as DaemonSet + hostNetwork via HelmChartConfig | Traefik default Deployment + klipper-lb svclb DaemonSet. The `traefik-helmchartconfig.yaml` file is **NOT applied** on OVH. |
|
||
| `servicelb` disabled (`--disable=servicelb`) | `servicelb` enabled (we didn't pass `--disable=servicelb`). This is what makes klipper-lb work. |
|
||
| sysctl `net.ipv4.ip_unprivileged_port_start=0` for hostNetwork Traefik | Not needed — klipper-lb proxies the port binding instead |
|
||
| UFW rules between 3 Hetzner IPs | UFW rules between 3 OVH IPs (51.81.83.33, 51.81.87.86, 51.81.85.248) |
|
||
| Kubeconfig at `~/.kube/honeydue-k3s.yaml` | Kubeconfig at `deploy-k3s/kubeconfig` |
|
||
| TLS at origin: not configured (CF Flexible) | Same — CF Flexible. `cloudflare-origin-cert` Secret exists (carried over) but Ingress doesn't reference it. |
|
||
|
||
---
|
||
|
||
## 11. Outstanding follow-ups (deferred, not blocking)
|
||
|
||
1. **No warm standby / rollback cluster.** OVH is solo production. An OVH
|
||
outage is a real outage; recovery time = §6 procedure (~30 min). User
|
||
plans to bring a second cluster up as a target.
|
||
2. **UFW allows 80/443 from world.** Hetzner had a network-layer Cloudflare-IP
|
||
allowlist on these ports. OVH currently relies on the L7
|
||
`cloudflare-only` Traefik middleware, which protects admin but NOT api /
|
||
web / apex (those routes have to be reachable from anywhere, but they're
|
||
then trivially DDoSable bypassing Cloudflare). Fix: add ufw allow rules
|
||
restricting `80/tcp` and `443/tcp` to Cloudflare's published IP ranges
|
||
(~22 IPv4 prefixes from https://www.cloudflare.com/ips-v4/).
|
||
3. **Cloudflare TLS Flexible → Full(strict).** Origin certs exist as Secret
|
||
but Ingress doesn't terminate TLS. Upgrading to Full(strict) requires
|
||
Traefik configured with the cert + an HTTPS entrypoint + Ingress
|
||
`tls:` block.
|
||
4. **`rbac.yaml` + `pod-disruption-budgets.yaml` should be in `03-deploy.sh`.**
|
||
They're currently bootstrap-only. Adding them is idempotent and prevents
|
||
the §9.1 footgun.
|
||
5. **Push notification metrics are log-derived, not counters.** Successes
|
||
aren't logged or counted. Proper Prometheus instrumentation (~15 lines in
|
||
`internal/push/client.go`) would give a real success/failure ratio.
|
||
6. **Worker has no `/metrics` endpoint.** `cmd/worker/main.go` serves `:6060`
|
||
for healthz only. Adding Asynq's `metrics.NewPrometheusExporter()` + a
|
||
ServiceMonitor + uncommenting the `worker` job stanza in
|
||
`vmagent-config` ConfigMap would give real queue depth and job latency.
|
||
7. **Ory Kratos.** Manifests exist (`manifests/kratos/`) but the deploy
|
||
is gated on operator-side prerequisites (Neon `kratos` database,
|
||
`auth.myhoneydue.com` DNS, real Apple+Google OIDC clients, Kratos image
|
||
tag pinned). Until `kratos-secrets` exists, `03-deploy.sh` silently
|
||
skips the Kratos apply.
|
||
8. **Hetzner cluster fully retired? `config.yaml` `nodes:` block describes
|
||
OVH; the bak kubeconfig is at `kubeconfig.hetzner.bak`. Boxes themselves
|
||
are operator-managed.
|
||
|
||
### 11.1 Dashboard observability gaps (raised 2026-06-03 during dashboard build)
|
||
|
||
Surfaced while building the `honeydue-eli5-overview` Grafana dashboard. Each
|
||
needs code or infra changes to expose; none blocks today's operations.
|
||
|
||
9. **node-exporter not deployed.** No node-level metrics today
|
||
(`node_filesystem_avail_bytes`, `node_memory_*`, `node_load1`, etc.).
|
||
The dashboard's pod-level memory/CPU panels are app-process only — a
|
||
node running out of disk would silently fail the cluster before any
|
||
dashboard signal showed it. Highest-priority Tier-3 item. Fix: deploy
|
||
`node-exporter` as a DaemonSet (~50 lines of YAML), add a scrape stanza
|
||
to `vmagent-config`, add a `Node disk free` stat panel.
|
||
10. **Traefik metrics not enabled.** Traefik can expose `/metrics` with
|
||
`traefik_entrypoint_requests_total` + `traefik_service_request_duration_seconds`,
|
||
giving edge-level visibility into requests that never reached api
|
||
pods (404s, redirects, middleware blocks). Enable via a
|
||
HelmChartConfig override that sets `metrics.prometheus.entryPoint=metrics`
|
||
+ adds a `:9100` entryPoint + a scrape stanza. Skipped today to avoid
|
||
Traefik restart risk; safe additive change when ready.
|
||
11. **Push notification success/failure counters** (already #5). Add
|
||
`prometheus.NewCounterVec` in `internal/push/client.go` with labels
|
||
`platform={ios,android}, outcome={success,failed,breaker_open,disabled}`.
|
||
Increments at every Send/SendActionable branch. Replaces the
|
||
log-derived "Push failures" stat on the dashboard with a real success
|
||
rate.
|
||
12. **Worker queue / job metrics** (already #6). Asynq has a built-in
|
||
Prometheus exporter (`asynq/x/metrics`). Wire it into the worker's
|
||
`:6060` health server (a single `healthMux.Handle` line) and
|
||
uncomment the worker scrape stanza in `vmagent-config`. Surfaces
|
||
queue depth, retry count, processing time per task type.
|
||
13. **Cache hit / miss rate.** `internal/services/cache_service.go` has
|
||
no counters. Add a Counter with labels `{operation=get|set, result=hit|miss}`
|
||
around the cache wrapper. ~10 lines. Useful once real traffic flows
|
||
to verify the ETag and Redis caches are paying their keep.
|
||
14. **APNs send-latency histogram.** Wrap `internal/push/apns.go::Send`
|
||
in a `prometheus.NewHistogramVec` keyed on outcome. Tells you when
|
||
Apple's gateway is slow (which correlates with their incident page).
|
||
|
||
---
|
||
|
||
## 12. Audit trail
|
||
|
||
| Date | Change |
|
||
|---|---|
|
||
| 2026-04-24 | Initial k3s cluster on Hetzner (Swarm → k3s migration) — see MIGRATION_NOTES.md |
|
||
| 2026-04-25 | `config.yaml` reconstructed from live ConfigMap (original file lost) |
|
||
| 2026-05-15 | Audit fixes: Redis auth required, admin basic auth, secrets-encryption flag |
|
||
| 2026-05-16 | `02-setup-secrets.sh` started carrying B2 credentials (was a manifest/script drift) |
|
||
| 2026-06-02 | Kratos scaffolding committed (not deployed) |
|
||
| 2026-06-03 | **Hetzner → OVH BHS cutover.** New 3-node cluster on 51.81.83.33, .87.86, .85.248. DNS cut on Cloudflare. Hetzner kubeconfig moved to `.bak`. Grafana `honeydue-eli5-overview` dashboard created. Hetzner cluster powered off later same day. |
|
||
| 2026-06-03 | Dashboard build-out: extended `honeydue-eli5-overview` to 22 panels covering Tier-1 (HTTP status, CPU per pod, goroutines, top slow) and Tier-2 (GC, network I/O, pod uptime, top 5xx) signals. Surfaced Tier-3 instrumentation gaps in §11.1. |
|