Commit Graph

4 Commits

Author SHA1 Message Date
Trey t 93fddc3769 feat(observability): ship pod logs to Loki via Grafana Alloy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs
from /var/log/pods and pushes them to Loki at obs.88oakapps.com,
reusing the existing OBS_INGEST_TOKEN (14-day retention).

- deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC
  + token Secret + Alloy config. Runs as root (/var/log/pods is 0750
  root:root) but otherwise locked down: all caps dropped, read-only
  root filesystem, seccomp RuntimeDefault, read-only hostPath mount.
- network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API
  + obs HTTPS), mirroring the vmagent egress policy.
- 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN
  substitution and waits for the DaemonSet rollout.

The Loki container, nginx /loki/api/v1/push route, and Grafana Loki
datasource live on the obs server and are not repo-managed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 20:04:09 -05:00
Trey t 139a990ebc fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00
Trey t ace03d2340 Security hardening: TLS at origin, security headers, network policies, admin probe fix
Four related hardening changes made on the live cluster during this
session. Each manifest captures the final working state so a fresh
`kubectl apply` of the repo reproduces it.

1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks
   pointing at `cloudflare-origin-cert` secret (installed imperatively
   from the CF Origin CA PEM). CF SSL mode flipped from Flexible to
   Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued
   cert that only CF can validate.

2. Traefik middleware attached to all three ingresses — `rate-limit`
   (100/min avg, 200 burst) and `security-headers` (frame-deny,
   nosniff, HSTS, referrer policy, permissions policy). `admin-auth`
   middleware was also defined in middleware.yaml but is not attached
   (needs an unset basic-auth secret) and was deleted at runtime.

3. `security-headers` middleware: stripped the
   Content-Security-Policy entry. The Go API sets its own CSP in
   internal/router/router.go that permits Google Fonts for the
   landing page. Two CSP headers combine via intersection (most
   restrictive wins), which would break the landing page. Next.js
   apps set their own CSP via middleware. Header kept documentation
   comments explain this.

4. NetworkPolicies — default-deny + explicit allows, applied. Added
   missing policies for `web`. Corrected the Traefik ingress rule: the
   scaffold used `namespaceSelector: kube-system`, but our Traefik
   runs as a DaemonSet with `hostNetwork: true`, so traffic arrives
   with the NODE IP as source. Fixed to an `ipBlock` list of the
   three node IPs plus the cluster pod CIDR (10.42.0.0/16).

5. admin livenessProbe path fix: was hitting /admin/ (404) which
   caused a 6-hour crashloop cycle (87 restarts) before the bug was
   caught. Fixed to / — matches the startupProbe and readinessProbe
   paths that were corrected earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:50:47 -05:00
Trey t 34553f3bec Add K3s dev deployment setup for single-node VPS
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.

Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:30:39 -05:00