12de5a230a660318d8520e51b16120d5eca6d101
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6de90acef7 |
feat(kratos): deploy Ory Kratos to production (Apple-only OIDC)
Auth was structurally broken — the api's Kratos middleware was pointing at http://kratos:4433 but Kratos wasn't deployed. The only thing keeping users logged in was a 5-min Redis cache; once it expired the middleware called Whoami → no DNS → 401 → forced relogin with no path back. This commit deploys Kratos for real: Manifests: - kratos.yaml + migrate-job.yaml: pin oryd/kratos:v26.2.0@sha256:92eedc... (CalVer current stable as of 2026-06-03) - configmap.yaml: drop Google OIDC provider (not in scope); fill the Apple provider with real Services ID / Team ID / Key ID — Apple now sits at providers[0] - kratos.yaml: drop the Google-secret env binding; rebind APPLE_PRIVATE_KEY to PROVIDERS_0_APPLE_PRIVATE_KEY (shifted from index 1) - network-policies.yaml: add a kratos egress rule to allow-egress-from-api. Without this, even with kratos running, the api gets "connection refused" on http://kratos:4433 (post-DNAT NetworkPolicy enforcement — runbook §9.2). Operator prerequisites that were completed alongside this commit: - Neon kratos database created (separate from honeyDue, owner neondb_owner) - Cloudflare DNS for auth.myhoneydue.com (3 A records, proxied) - kratos: block added to config.yaml (gitignored): DSN to the Neon DIRECT endpoint, cookie + cipher secrets generated, Fastmail SMTPS URI, .p8 contents inline Out of scope intentionally: - Google sign-in (additive; can append providers[] later) - Migrating existing auth_user rows onto Kratos identities — pre-prod; existing users will need to sign in fresh, which creates a new Kratos identity and a new local user row (per migration plan in manifests/kratos/README.md). Verified end-to-end: - 338 schema migrations applied successfully - 2/2 kratos pods Ready - api → kratos:4433/sessions/whoami returns 401 for invalid token (was "connection refused" before this commit's NetworkPolicy patch) - auth.myhoneydue.com resolves through CF; cloudflare-only middleware keeps the origin protected exactly like the other hostnames Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
93fddc3769 |
feat(observability): ship pod logs to Loki via Grafana Alloy
Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs from /var/log/pods and pushes them to Loki at obs.88oakapps.com, reusing the existing OBS_INGEST_TOKEN (14-day retention). - deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC + token Secret + Alloy config. Runs as root (/var/log/pods is 0750 root:root) but otherwise locked down: all caps dropped, read-only root filesystem, seccomp RuntimeDefault, read-only hostPath mount. - network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API + obs HTTPS), mirroring the vmagent egress policy. - 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN substitution and waits for the DaemonSet rollout. The Loki container, nginx /loki/api/v1/push route, and Grafana Loki datasource live on the obs server and are not repo-managed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
139a990ebc |
fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ace03d2340 |
Security hardening: TLS at origin, security headers, network policies, admin probe fix
Four related hardening changes made on the live cluster during this session. Each manifest captures the final working state so a fresh `kubectl apply` of the repo reproduces it. 1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks pointing at `cloudflare-origin-cert` secret (installed imperatively from the CF Origin CA PEM). CF SSL mode flipped from Flexible to Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued cert that only CF can validate. 2. Traefik middleware attached to all three ingresses — `rate-limit` (100/min avg, 200 burst) and `security-headers` (frame-deny, nosniff, HSTS, referrer policy, permissions policy). `admin-auth` middleware was also defined in middleware.yaml but is not attached (needs an unset basic-auth secret) and was deleted at runtime. 3. `security-headers` middleware: stripped the Content-Security-Policy entry. The Go API sets its own CSP in internal/router/router.go that permits Google Fonts for the landing page. Two CSP headers combine via intersection (most restrictive wins), which would break the landing page. Next.js apps set their own CSP via middleware. Header kept documentation comments explain this. 4. NetworkPolicies — default-deny + explicit allows, applied. Added missing policies for `web`. Corrected the Traefik ingress rule: the scaffold used `namespaceSelector: kube-system`, but our Traefik runs as a DaemonSet with `hostNetwork: true`, so traffic arrives with the NODE IP as source. Fixed to an `ipBlock` list of the three node IPs plus the cluster pod CIDR (10.42.0.0/16). 5. admin livenessProbe path fix: was hitting /admin/ (404) which caused a 6-hour crashloop cycle (87 restarts) before the bug was caught. Fixed to / — matches the startupProbe and readinessProbe paths that were corrected earlier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
34553f3bec |
Add K3s dev deployment setup for single-node VPS
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible storage (replaces B2), Redis, API, worker, and admin. Includes fully automated setup scripts (00-init through 04-verify), server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik, network policies, RBAC, and security contexts matching prod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |