d74cfeee6274bd0319063193e601f78f5979d1e4
9 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b66151ddd9 |
feat(auth): scaffold Ory Kratos identity service — phase 1 (infrastructure)
First phase of replacing the hand-rolled auth (internal/services/auth_service.go
et al.) with Ory Kratos. This commit is infrastructure only — Kratos will run
but nothing consumes it yet; the Go API still does its own auth until phase 2.
Adds deploy-k3s/manifests/kratos/:
- configmap.yaml — kratos.yml, identity schema, Google/Apple OIDC claim
mappers (no secrets in the ConfigMap)
- migrate-job.yaml — `kratos migrate sql`, run before the Deployment
- kratos.yaml — Deployment (x2), Service, NetworkPolicies
- ingress.yaml — auth.myhoneydue.com -> Kratos public API :4433
- README.md — operator prerequisites + deploy runbook
Wiring:
- 02-setup-secrets.sh creates kratos-secrets, gated on a config.yaml `kratos:`
block (DSN, cookie/cipher, SMTP URI, OIDC client secret, Apple key).
- 03-deploy.sh applies the Kratos manifests + runs the migrate Job, gated on
the kratos-secrets Secret existing.
Both gates mean the existing stack deploys completely unaffected until the
operator completes the prerequisites (Neon `kratos` DB, auth.myhoneydue.com
DNS, Apple/Google OAuth apps, Kratos image version). Pre-production, so no
user-data migration — see manifests/kratos/README.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
93fddc3769 |
feat(observability): ship pod logs to Loki via Grafana Alloy
Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs from /var/log/pods and pushes them to Loki at obs.88oakapps.com, reusing the existing OBS_INGEST_TOKEN (14-day retention). - deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC + token Secret + Alloy config. Runs as root (/var/log/pods is 0750 root:root) but otherwise locked down: all caps dropped, read-only root filesystem, seccomp RuntimeDefault, read-only hostPath mount. - network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API + obs HTTPS), mirroring the vmagent egress policy. - 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN substitution and waits for the DaemonSet rollout. The Loki container, nginx /loki/api/v1/push route, and Grafana Loki datasource live on the obs server and are not repo-managed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c77ff07ce9 |
fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
139a990ebc |
fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
12b2f9d43b |
Adopt pressly/goose for schema migrations
Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
Neon's pgbouncer transaction-mode pooler — turns out this is a
known and documented limitation that bites golang-migrate too
Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
failure rolls back cleanly instead of leaving a "dirty" version
state requiring manual force-reset (golang-migrate's known
weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
single Kubernetes Job, which IS the singleton process. No advisory
lock needed at all.
Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
DB at adoption, stripped of psql-only directives that block goose's
bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
their effects folded into this baseline; deleted from the live tree
but preserved in git history at
|
||
|
|
d3708e6c72 |
Fix /metrics double-gzip + deploy script for amd64 build
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so vmagent received double-compressed bytes that failed the Prometheus parser with binary garbage. Skipping /metrics in the gzip Skipper. Three deploy-script fixes uncovered while shipping this: - _config.sh had backticks around \"kubectl get cm\" inside the python heredoc, which bash treated as command substitution when KUBECONFIG was set. Quoted the literal instead. - 03-deploy.sh now passes --platform linux/amd64 to all docker builds so arm64 Macs don't push images that fail with \"exec format error\" on the Hetzner CX nodes. - OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of the actual deploy/prod.env at the repo root. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
372d4d2d37 |
deploy-k3s: apply observability manifests during 03-deploy
vmagent.yaml lives under manifests/observability/; the deploy script now substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest before apply, and waits on the vmagent rollout. Manual kubectl apply is no longer needed after the next deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
15359401fa |
Deploy honeyDueAPI-Web to k3s at app.myhoneydue.com
The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs alongside api/worker/admin on the cluster. Uses a server-side proxy pattern: browser hits app.myhoneydue.com, Next.js route handlers forward to the Go API with an httpOnly cookie, so no CORS entry or Allowed-Hosts change is needed on the API side. Availability mirrors api (3 replicas, PDB minAvailable:2, topologySpreadConstraints across nodes). Changes: - deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp, reads API_URL from honeydue-config. - deploy-k3s/manifests/web/service.yaml: ClusterIP :3000. - deploy-k3s/manifests/rbac.yaml: ServiceAccount web with automountServiceAccountToken: false. - deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb minAvailable: 2. - deploy-k3s/manifests/ingress/ingress-simple.yaml: route app.myhoneydue.com → web:3000. - deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap. - deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed). Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin image that was previously only done manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
34553f3bec |
Add K3s dev deployment setup for single-node VPS
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible storage (replaces B2), Redis, API, worker, and admin. Includes fully automated setup scripts (00-init through 04-verify), server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik, network policies, RBAC, and security contexts matching prod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |