Stack of optimizations against the same Hetzner→Neon transatlantic link.
The trace revealed every visible ms was network/proxy overhead — DB
execution itself is sub-millisecond per query (verified via EXPLAIN
ANALYZE: index scans on every hot path).
Connection layer:
- DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer
transaction-mode keeps backend Postgres connections warm so we no
longer pay the ~110ms Postgres-startup RTT on cold queries.
- GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s,
MaxIdleTime added (default 0 = never close idle).
- Eager pool warm-up at boot via parallel pings — first user request
no longer pays the ~440ms TCP+TLS+startup handshake.
- Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will
evict cold keys instead of erroring at the 256MB limit.
Auth layer:
- TokenCacheTTL 5min → 1 hour (Redis token cache).
- UserCacheTTL 30s → 5min (in-memory User cache, per pod).
- UserCache gains a 5,000-entry LRU cap so a flood of unique users
can't blow up pod RSS. ~5MB worst-case per pod.
- Token + user lookup collapsed from 2 GORM Preload queries into a
single INNER JOIN. Saves 1 RTT per cold-cache request.
- Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL
spans nest under the parent HTTP request in Jaeger.
Service layer:
- TaskService.ListTasks: replaced two-step
FindResidenceIDsByUser → GetKanbanDataForMultipleResidences
with a single GetKanbanDataForUser that uses a Postgres subquery
for residence-access. One round-trip instead of two.
- New CacheService residence-IDs cache: \"residence_ids_user:<id>\"
with 5-min TTL. Wired into Task/Residence/Contractor/Document
services for the four hot read paths that need this list.
- Cache invalidation on every relevant mutation: CreateResidence,
DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence
invalidates every member of the residence, not just the owner.
What this stacks up to (Hetzner→Neon, before US migration):
Path Before After (target)
Cache-warm authed read ~800ms ~100-200ms
Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms
First request after deploy ~2500ms ~700-900ms
The endgame US-region migration on top of this gets us to ~30-50ms
warm-cache, but we're shippable at ~150ms warm right now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so
vmagent received double-compressed bytes that failed the Prometheus
parser with binary garbage. Skipping /metrics in the gzip Skipper.
Three deploy-script fixes uncovered while shipping this:
- _config.sh had backticks around \"kubectl get cm\" inside the python
heredoc, which bash treated as command substitution when KUBECONFIG
was set. Quoted the literal instead.
- 03-deploy.sh now passes --platform linux/amd64 to all docker builds
so arm64 Macs don't push images that fail with \"exec format error\"
on the Hetzner CX nodes.
- OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of
the actual deploy/prod.env at the repo root.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_env was missing 5 keys that exist in the live honeydue-config
ConfigMap (drift introduced over time by manual kubectl patches):
STATIC_DIR, STORAGE_UPLOAD_DIR, STORAGE_BASE_URL, B2_REGION, B2_USE_SSL.
Without these, running 03-deploy.sh would silently drop them and
break static asset serving + B2 region/TLS.
Also:
- Move B2_KEY_ID/B2_APP_KEY out of generate_env: they're credentials
and belong in honeydue-secrets, not cleartext in the ConfigMap. The
api/worker deployments still need to be wired to read them via
envFrom: secretRef before B2 uploads will work — pre-existing gap,
not caused by this commit.
- Use the in-namespace short DNS form for REDIS_URL ('redis:6379') to
match what the live cluster has — pods' resolv.conf search path
already covers honeydue.svc.cluster.local.
- config.yaml.example: add b2_region, b2_use_ssl, upload_dir, base_url,
static_dir under storage so a fresh bootstrap sets them correctly.
Verified by sourcing _config.sh and diffing generate_env output against
`kubectl get cm honeydue-config -o jsonpath='{.data}'`: clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs
alongside api/worker/admin on the cluster. Uses a server-side proxy
pattern: browser hits app.myhoneydue.com, Next.js route handlers
forward to the Go API with an httpOnly cookie, so no CORS entry or
Allowed-Hosts change is needed on the API side.
Availability mirrors api (3 replicas, PDB minAvailable:2,
topologySpreadConstraints across nodes).
Changes:
- deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root
FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp,
reads API_URL from honeydue-config.
- deploy-k3s/manifests/web/service.yaml: ClusterIP :3000.
- deploy-k3s/manifests/rbac.yaml: ServiceAccount web with
automountServiceAccountToken: false.
- deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb
minAvailable: 2.
- deploy-k3s/manifests/ingress/ingress-simple.yaml: route
app.myhoneydue.com → web:3000.
- deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap.
- deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image
alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and
NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed).
Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin
image that was previously only done manually.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.
Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>