12 Commits

Author SHA1 Message Date
Trey t b66151ddd9 feat(auth): scaffold Ory Kratos identity service — phase 1 (infrastructure)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
First phase of replacing the hand-rolled auth (internal/services/auth_service.go
et al.) with Ory Kratos. This commit is infrastructure only — Kratos will run
but nothing consumes it yet; the Go API still does its own auth until phase 2.

Adds deploy-k3s/manifests/kratos/:
- configmap.yaml  — kratos.yml, identity schema, Google/Apple OIDC claim
                     mappers (no secrets in the ConfigMap)
- migrate-job.yaml — `kratos migrate sql`, run before the Deployment
- kratos.yaml     — Deployment (x2), Service, NetworkPolicies
- ingress.yaml    — auth.myhoneydue.com -> Kratos public API :4433
- README.md       — operator prerequisites + deploy runbook

Wiring:
- 02-setup-secrets.sh creates kratos-secrets, gated on a config.yaml `kratos:`
  block (DSN, cookie/cipher, SMTP URI, OIDC client secret, Apple key).
- 03-deploy.sh applies the Kratos manifests + runs the migrate Job, gated on
  the kratos-secrets Secret existing.

Both gates mean the existing stack deploys completely unaffected until the
operator completes the prerequisites (Neon `kratos` DB, auth.myhoneydue.com
DNS, Apple/Google OAuth apps, Kratos image version). Pre-production, so no
user-data migration — see manifests/kratos/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:24:38 -05:00
Trey t 93fddc3769 feat(observability): ship pod logs to Loki via Grafana Alloy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs
from /var/log/pods and pushes them to Loki at obs.88oakapps.com,
reusing the existing OBS_INGEST_TOKEN (14-day retention).

- deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC
  + token Secret + Alloy config. Runs as root (/var/log/pods is 0750
  root:root) but otherwise locked down: all caps dropped, read-only
  root filesystem, seccomp RuntimeDefault, read-only hostPath mount.
- network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API
  + obs HTTPS), mirroring the vmagent egress policy.
- 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN
  substitution and waits for the DaemonSet rollout.

The Loki container, nginx /loki/api/v1/push route, and Grafana Loki
datasource live on the obs server and are not repo-managed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 20:04:09 -05:00
Trey t c77ff07ce9 fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps),
tracked in deploy-k3s/SECURITY.md, plus fixes from two independent
post-remediation reviews.

Auth & sessions:
- SHA-256 hashed auth-token storage (C1); prior-token cache eviction on
  re-login (MEDIUM-1)
- local Google JWKS verification, iss/aud/exp checks (C2/C3)
- constant-time login + generic errors (L1/LIVE-L11/LIVE-L13)
- per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3)
- verified-email gating, login rate limiting (LIVE-L19, H1-H3)

IAP & webhooks:
- Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6)
- migrations 000003-000006 (token hashing, IAP replay, audit_log +
  webhook_event_log table creation, append-only audit log)

Authorization & races:
- file-ownership owner-OR-member fix (C7), atomic share-code join
  (C9/H9), device-token reassignment (C8/LOW-3)

Secrets & deploy:
- secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis
  password out of the ConfigMap (HIGH-1); B2 keys reconciled
- digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics
  lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban +
  unattended-upgrades at provision; secret-rotation runbook

Build, vet, and the full test suite (incl. -race) pass; the goose
migration chain is verified against PostgreSQL 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:28:33 -05:00
Trey t 139a990ebc fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00
Trey t 12b2f9d43b Adopt pressly/goose for schema migrations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
  link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
  Neon's pgbouncer transaction-mode pooler — turns out this is a
  known and documented limitation that bites golang-migrate too

Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
  failure rolls back cleanly instead of leaving a "dirty" version
  state requiring manual force-reset (golang-migrate's known
  weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
  single Kubernetes Job, which IS the singleton process. No advisory
  lock needed at all.

Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
  DB at adoption, stripped of psql-only directives that block goose's
  bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
  their effects folded into this baseline; deleted from the live tree
  but preserved in git history at 58e6997.
- Dockerfile installs `goose v3.22.1` at build time and copies the
  binary into the api image. The migrate Job reuses the api image with
  command=goose, so no separate image to build/push/version.
- deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips
  the -pooler segment from DB_HOST (advisory lock won't survive
  pgbouncer transaction-mode), runs `goose up`, exits.
- deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the
  fresh one, `kubectl wait --for=condition=complete --timeout=10m`,
  then proceeds with api/worker rollout. Job failure aborts the deploy
  before any new app pod sees a stale schema.
- internal/database/database.go::RequireSchemaApplied checks
  goose_db_version on startup. api/worker refuse to boot if the
  table is missing or its latest row has is_applied=false — the
  fail-fast for "operator forgot to run migrate."
- Makefile: migrate-up / migrate-down / migrate-status / migrate-new
  for local workflow.

Production DB was bootstrapped manually:
  $ goose -dir migrations postgres "$DSN" version  # creates table
  $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

Smoke test against fresh Postgres locally: 50 user tables created in
284ms via `goose up`, version_id=1 + is_applied=t recorded.

Verified the local goose CLI talks to prod successfully:
  $ goose ... status
  Applied At                  Migration
  =======================================
  Mon Apr 27 03:43:55 2026 -- 000001_init.sql

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:46:36 -05:00
Trey t 88fb1751c7 Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Stack of optimizations against the same Hetzner→Neon transatlantic link.
The trace revealed every visible ms was network/proxy overhead — DB
execution itself is sub-millisecond per query (verified via EXPLAIN
ANALYZE: index scans on every hot path).

Connection layer:
- DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer
  transaction-mode keeps backend Postgres connections warm so we no
  longer pay the ~110ms Postgres-startup RTT on cold queries.
- GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s,
  MaxIdleTime added (default 0 = never close idle).
- Eager pool warm-up at boot via parallel pings — first user request
  no longer pays the ~440ms TCP+TLS+startup handshake.
- Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will
  evict cold keys instead of erroring at the 256MB limit.

Auth layer:
- TokenCacheTTL 5min → 1 hour (Redis token cache).
- UserCacheTTL 30s → 5min (in-memory User cache, per pod).
- UserCache gains a 5,000-entry LRU cap so a flood of unique users
  can't blow up pod RSS. ~5MB worst-case per pod.
- Token + user lookup collapsed from 2 GORM Preload queries into a
  single INNER JOIN. Saves 1 RTT per cold-cache request.
- Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL
  spans nest under the parent HTTP request in Jaeger.

Service layer:
- TaskService.ListTasks: replaced two-step
  FindResidenceIDsByUser → GetKanbanDataForMultipleResidences
  with a single GetKanbanDataForUser that uses a Postgres subquery
  for residence-access. One round-trip instead of two.
- New CacheService residence-IDs cache: \"residence_ids_user:<id>\"
  with 5-min TTL. Wired into Task/Residence/Contractor/Document
  services for the four hot read paths that need this list.
- Cache invalidation on every relevant mutation: CreateResidence,
  DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence
  invalidates every member of the residence, not just the owner.

What this stacks up to (Hetzner→Neon, before US migration):
  Path                                 Before        After (target)
  Cache-warm authed read               ~800ms        ~100-200ms
  Cache-cold authed read (1st in 1hr)  ~2500ms       ~500-700ms
  First request after deploy           ~2500ms       ~700-900ms

The endgame US-region migration on top of this gets us to ~30-50ms
warm-cache, but we're shippable at ~150ms warm right now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:13:50 -05:00
Trey t bc3da007db Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider
that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling
is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod,
overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api
and honeydue-worker. otelecho.Middleware opens a span per HTTP request.

Step 2 — Manual spans: storage_service.Upload now takes ctx and emits
storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket,
result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token
spans with topic, status_code, reason. Asynq middleware emits
asynq.handle:<task_type> per job with retry/payload attrs and records
asynq_job_duration_seconds.

Step 3 — Database: otelgorm plugin registered in database.Connect, so
any SQL emitted via db.WithContext(ctx) attaches to the request span.
Every repository now exposes WithContext(ctx) *XRepository as the
migration helper. TaskService.ListTasks and GetTasksByResidence are
migrated end-to-end (ctx threaded through handler → service → repo);
remaining services adopt the same pattern incrementally — pre-migration
methods still emit untraced SQL via the unchanged db field.

OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env →
honeydue-secrets → api+worker Deployments via secretKeyRef (optional).
02-setup-secrets.sh sources them from prod.env on next run; manifests
mark both env vars optional so the deployment rolls without traces if
the secret is absent.

ch15 observability doc now lists what produces spans today vs the
remaining migration work, with the explicit per-method pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:28:05 -05:00
Trey t d3708e6c72 Fix /metrics double-gzip + deploy script for amd64 build
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so
vmagent received double-compressed bytes that failed the Prometheus
parser with binary garbage. Skipping /metrics in the gzip Skipper.

Three deploy-script fixes uncovered while shipping this:
- _config.sh had backticks around \"kubectl get cm\" inside the python
  heredoc, which bash treated as command substitution when KUBECONFIG
  was set. Quoted the literal instead.
- 03-deploy.sh now passes --platform linux/amd64 to all docker builds
  so arm64 Macs don't push images that fail with \"exec format error\"
  on the Hetzner CX nodes.
- OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of
  the actual deploy/prod.env at the repo root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:42:15 -05:00
Trey t 372d4d2d37 deploy-k3s: apply observability manifests during 03-deploy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
vmagent.yaml lives under manifests/observability/; the deploy script now
substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest
before apply, and waits on the vmagent rollout. Manual kubectl apply is
no longer needed after the next deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:59 -05:00
Trey t 57cef36379 deploy-k3s: align _config.sh::generate_env with live ConfigMap
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
generate_env was missing 5 keys that exist in the live honeydue-config
ConfigMap (drift introduced over time by manual kubectl patches):
STATIC_DIR, STORAGE_UPLOAD_DIR, STORAGE_BASE_URL, B2_REGION, B2_USE_SSL.
Without these, running 03-deploy.sh would silently drop them and
break static asset serving + B2 region/TLS.

Also:
- Move B2_KEY_ID/B2_APP_KEY out of generate_env: they're credentials
  and belong in honeydue-secrets, not cleartext in the ConfigMap. The
  api/worker deployments still need to be wired to read them via
  envFrom: secretRef before B2 uploads will work — pre-existing gap,
  not caused by this commit.
- Use the in-namespace short DNS form for REDIS_URL ('redis:6379') to
  match what the live cluster has — pods' resolv.conf search path
  already covers honeydue.svc.cluster.local.
- config.yaml.example: add b2_region, b2_use_ssl, upload_dir, base_url,
  static_dir under storage so a fresh bootstrap sets them correctly.

Verified by sourcing _config.sh and diffing generate_env output against
`kubectl get cm honeydue-config -o jsonpath='{.data}'`: clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:38:37 -05:00
Trey t 15359401fa Deploy honeyDueAPI-Web to k3s at app.myhoneydue.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs
alongside api/worker/admin on the cluster. Uses a server-side proxy
pattern: browser hits app.myhoneydue.com, Next.js route handlers
forward to the Go API with an httpOnly cookie, so no CORS entry or
Allowed-Hosts change is needed on the API side.

Availability mirrors api (3 replicas, PDB minAvailable:2,
topologySpreadConstraints across nodes).

Changes:
- deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root
  FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp,
  reads API_URL from honeydue-config.
- deploy-k3s/manifests/web/service.yaml: ClusterIP :3000.
- deploy-k3s/manifests/rbac.yaml: ServiceAccount web with
  automountServiceAccountToken: false.
- deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb
  minAvailable: 2.
- deploy-k3s/manifests/ingress/ingress-simple.yaml: route
  app.myhoneydue.com → web:3000.
- deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap.
- deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image
  alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and
  NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed).
  Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin
  image that was previously only done manually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 10:11:17 -05:00
Trey t 34553f3bec Add K3s dev deployment setup for single-node VPS
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.

Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:30:39 -05:00