Commit Graph

12 Commits

Author SHA1 Message Date
Trey t 88fb1751c7 Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Stack of optimizations against the same Hetzner→Neon transatlantic link.
The trace revealed every visible ms was network/proxy overhead — DB
execution itself is sub-millisecond per query (verified via EXPLAIN
ANALYZE: index scans on every hot path).

Connection layer:
- DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer
  transaction-mode keeps backend Postgres connections warm so we no
  longer pay the ~110ms Postgres-startup RTT on cold queries.
- GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s,
  MaxIdleTime added (default 0 = never close idle).
- Eager pool warm-up at boot via parallel pings — first user request
  no longer pays the ~440ms TCP+TLS+startup handshake.
- Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will
  evict cold keys instead of erroring at the 256MB limit.

Auth layer:
- TokenCacheTTL 5min → 1 hour (Redis token cache).
- UserCacheTTL 30s → 5min (in-memory User cache, per pod).
- UserCache gains a 5,000-entry LRU cap so a flood of unique users
  can't blow up pod RSS. ~5MB worst-case per pod.
- Token + user lookup collapsed from 2 GORM Preload queries into a
  single INNER JOIN. Saves 1 RTT per cold-cache request.
- Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL
  spans nest under the parent HTTP request in Jaeger.

Service layer:
- TaskService.ListTasks: replaced two-step
  FindResidenceIDsByUser → GetKanbanDataForMultipleResidences
  with a single GetKanbanDataForUser that uses a Postgres subquery
  for residence-access. One round-trip instead of two.
- New CacheService residence-IDs cache: \"residence_ids_user:<id>\"
  with 5-min TTL. Wired into Task/Residence/Contractor/Document
  services for the four hot read paths that need this list.
- Cache invalidation on every relevant mutation: CreateResidence,
  DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence
  invalidates every member of the residence, not just the owner.

What this stacks up to (Hetzner→Neon, before US migration):
  Path                                 Before        After (target)
  Cache-warm authed read               ~800ms        ~100-200ms
  Cache-cold authed read (1st in 1hr)  ~2500ms       ~500-700ms
  First request after deploy           ~2500ms       ~700-900ms

The endgame US-region migration on top of this gets us to ~30-50ms
warm-cache, but we're shippable at ~150ms warm right now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:13:50 -05:00
Trey t bc3da007db Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider
that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling
is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod,
overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api
and honeydue-worker. otelecho.Middleware opens a span per HTTP request.

Step 2 — Manual spans: storage_service.Upload now takes ctx and emits
storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket,
result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token
spans with topic, status_code, reason. Asynq middleware emits
asynq.handle:<task_type> per job with retry/payload attrs and records
asynq_job_duration_seconds.

Step 3 — Database: otelgorm plugin registered in database.Connect, so
any SQL emitted via db.WithContext(ctx) attaches to the request span.
Every repository now exposes WithContext(ctx) *XRepository as the
migration helper. TaskService.ListTasks and GetTasksByResidence are
migrated end-to-end (ctx threaded through handler → service → repo);
remaining services adopt the same pattern incrementally — pre-migration
methods still emit untraced SQL via the unchanged db field.

OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env →
honeydue-secrets → api+worker Deployments via secretKeyRef (optional).
02-setup-secrets.sh sources them from prod.env on next run; manifests
mark both env vars optional so the deployment rolls without traces if
the secret is absent.

ch15 observability doc now lists what produces spans today vs the
remaining migration work, with the explicit per-method pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:28:05 -05:00
Trey t d3708e6c72 Fix /metrics double-gzip + deploy script for amd64 build
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so
vmagent received double-compressed bytes that failed the Prometheus
parser with binary garbage. Skipping /metrics in the gzip Skipper.

Three deploy-script fixes uncovered while shipping this:
- _config.sh had backticks around \"kubectl get cm\" inside the python
  heredoc, which bash treated as command substitution when KUBECONFIG
  was set. Quoted the literal instead.
- 03-deploy.sh now passes --platform linux/amd64 to all docker builds
  so arm64 Macs don't push images that fail with \"exec format error\"
  on the Hetzner CX nodes.
- OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of
  the actual deploy/prod.env at the repo root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:42:15 -05:00
Trey t 372d4d2d37 deploy-k3s: apply observability manifests during 03-deploy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
vmagent.yaml lives under manifests/observability/; the deploy script now
substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest
before apply, and waits on the vmagent rollout. Manual kubectl apply is
no longer needed after the next deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:59 -05:00
Trey t df78d9ccd8 Add Prometheus metrics + vmagent push to obs.88oakapps.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.

Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.

deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.

docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:17 -05:00
Trey t 1cd6cafa9d deploy-k3s: wire B2_KEY_ID/B2_APP_KEY into api Deployment
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The B2 credentials existed in honeydue-secrets (created by
02-setup-secrets.sh) but were never referenced from the api
Deployment, so StorageConfig.IsS3() returned false at runtime →
StorageService fell back to local filesystem. With
readOnlyRootFilesystem=true on the api container, that local
fallback would silently fail on every upload — meaning every
photo, document, and task-completion upload was broken in prod
since the k3s migration on 2026-04-24.

Adding both as secretKeyRef on the api container only (the worker
doesn't perform uploads). Verified end-to-end with a registered
test user: source PDF (sha256=3af3a645...) → POST /api/uploads/document/
→ POST /api/documents/ → GET /api/media/document/:id → byte-identical
download. Storage init log now reports "Storage service initialized (S3)".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:53:25 -05:00
Trey t 57cef36379 deploy-k3s: align _config.sh::generate_env with live ConfigMap
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
generate_env was missing 5 keys that exist in the live honeydue-config
ConfigMap (drift introduced over time by manual kubectl patches):
STATIC_DIR, STORAGE_UPLOAD_DIR, STORAGE_BASE_URL, B2_REGION, B2_USE_SSL.
Without these, running 03-deploy.sh would silently drop them and
break static asset serving + B2 region/TLS.

Also:
- Move B2_KEY_ID/B2_APP_KEY out of generate_env: they're credentials
  and belong in honeydue-secrets, not cleartext in the ConfigMap. The
  api/worker deployments still need to be wired to read them via
  envFrom: secretRef before B2 uploads will work — pre-existing gap,
  not caused by this commit.
- Use the in-namespace short DNS form for REDIS_URL ('redis:6379') to
  match what the live cluster has — pods' resolv.conf search path
  already covers honeydue.svc.cluster.local.
- config.yaml.example: add b2_region, b2_use_ssl, upload_dir, base_url,
  static_dir under storage so a fresh bootstrap sets them correctly.

Verified by sourcing _config.sh and diffing generate_env output against
`kubectl get cm honeydue-config -o jsonpath='{.data}'`: clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:38:37 -05:00
Trey t 9ea058347f Fix Apple Sign In: update bundle IDs from old com.tt.honeyDue.* to com.myhoneydue.*
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The iOS app was renamed (MyCrib → Casera → honeyDue) and the bundle ID
was updated to com.myhoneydue.honeyDue (release) / .dev (debug), but
APPLE_CLIENT_ID and APNS_TOPIC across env templates and k3s configs
still pointed at the old com.tt.honeyDue.honeyDueDev value. This made
verifyAudience reject every Apple identity token (aud claim mismatch).

Updated:
- deploy/prod.env.example: bundle ID + comment that empty client_id
  rejects all tokens with DEBUG=false
- .env.example: add Sign in with Apple block (was missing entirely)
- deploy-k3s{,-dev}/config.yaml.example: apple_auth.client_id default
- deploy-k3s-dev/scripts/00-init.sh: same
- docker-compose.dev.yml: APNS_TOPIC fallback
- docs/deployment/10-secrets-config.md: doc reference

The live deploy/prod.env and local .env are .gitignored — they were
edited in place and need to ship via deploy_prod.sh to take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:58:44 -05:00
Trey t ace03d2340 Security hardening: TLS at origin, security headers, network policies, admin probe fix
Four related hardening changes made on the live cluster during this
session. Each manifest captures the final working state so a fresh
`kubectl apply` of the repo reproduces it.

1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks
   pointing at `cloudflare-origin-cert` secret (installed imperatively
   from the CF Origin CA PEM). CF SSL mode flipped from Flexible to
   Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued
   cert that only CF can validate.

2. Traefik middleware attached to all three ingresses — `rate-limit`
   (100/min avg, 200 burst) and `security-headers` (frame-deny,
   nosniff, HSTS, referrer policy, permissions policy). `admin-auth`
   middleware was also defined in middleware.yaml but is not attached
   (needs an unset basic-auth secret) and was deleted at runtime.

3. `security-headers` middleware: stripped the
   Content-Security-Policy entry. The Go API sets its own CSP in
   internal/router/router.go that permits Google Fonts for the
   landing page. Two CSP headers combine via intersection (most
   restrictive wins), which would break the landing page. Next.js
   apps set their own CSP via middleware. Header kept documentation
   comments explain this.

4. NetworkPolicies — default-deny + explicit allows, applied. Added
   missing policies for `web`. Corrected the Traefik ingress rule: the
   scaffold used `namespaceSelector: kube-system`, but our Traefik
   runs as a DaemonSet with `hostNetwork: true`, so traffic arrives
   with the NODE IP as source. Fixed to an `ipBlock` list of the
   three node IPs plus the cluster pod CIDR (10.42.0.0/16).

5. admin livenessProbe path fix: was hitting /admin/ (404) which
   caused a 6-hour crashloop cycle (87 restarts) before the bug was
   caught. Fixed to / — matches the startupProbe and readinessProbe
   paths that were corrected earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:50:47 -05:00
Trey t 15359401fa Deploy honeyDueAPI-Web to k3s at app.myhoneydue.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs
alongside api/worker/admin on the cluster. Uses a server-side proxy
pattern: browser hits app.myhoneydue.com, Next.js route handlers
forward to the Go API with an httpOnly cookie, so no CORS entry or
Allowed-Hosts change is needed on the API side.

Availability mirrors api (3 replicas, PDB minAvailable:2,
topologySpreadConstraints across nodes).

Changes:
- deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root
  FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp,
  reads API_URL from honeydue-config.
- deploy-k3s/manifests/web/service.yaml: ClusterIP :3000.
- deploy-k3s/manifests/rbac.yaml: ServiceAccount web with
  automountServiceAccountToken: false.
- deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb
  minAvailable: 2.
- deploy-k3s/manifests/ingress/ingress-simple.yaml: route
  app.myhoneydue.com → web:3000.
- deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap.
- deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image
  alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and
  NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed).
  Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin
  image that was previously only done manually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 10:11:17 -05:00
Trey t 6f303dbbaa Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00
Trey t 34553f3bec Add K3s dev deployment setup for single-node VPS
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.

Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:30:39 -05:00