Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs
from /var/log/pods and pushes them to Loki at obs.88oakapps.com,
reusing the existing OBS_INGEST_TOKEN (14-day retention).
- deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC
+ token Secret + Alloy config. Runs as root (/var/log/pods is 0750
root:root) but otherwise locked down: all caps dropped, read-only
root filesystem, seccomp RuntimeDefault, read-only hostPath mount.
- network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API
+ obs HTTPS), mirroring the vmagent egress policy.
- 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN
substitution and waits for the DaemonSet rollout.
The Loki container, nginx /loki/api/v1/push route, and Grafana Loki
datasource live on the obs server and are not repo-managed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four related hardening changes made on the live cluster during this
session. Each manifest captures the final working state so a fresh
`kubectl apply` of the repo reproduces it.
1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks
pointing at `cloudflare-origin-cert` secret (installed imperatively
from the CF Origin CA PEM). CF SSL mode flipped from Flexible to
Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued
cert that only CF can validate.
2. Traefik middleware attached to all three ingresses — `rate-limit`
(100/min avg, 200 burst) and `security-headers` (frame-deny,
nosniff, HSTS, referrer policy, permissions policy). `admin-auth`
middleware was also defined in middleware.yaml but is not attached
(needs an unset basic-auth secret) and was deleted at runtime.
3. `security-headers` middleware: stripped the
Content-Security-Policy entry. The Go API sets its own CSP in
internal/router/router.go that permits Google Fonts for the
landing page. Two CSP headers combine via intersection (most
restrictive wins), which would break the landing page. Next.js
apps set their own CSP via middleware. Header kept documentation
comments explain this.
4. NetworkPolicies — default-deny + explicit allows, applied. Added
missing policies for `web`. Corrected the Traefik ingress rule: the
scaffold used `namespaceSelector: kube-system`, but our Traefik
runs as a DaemonSet with `hostNetwork: true`, so traffic arrives
with the NODE IP as source. Fixed to an `ipBlock` list of the
three node IPs plus the cluster pod CIDR (10.42.0.0/16).
5. admin livenessProbe path fix: was hitting /admin/ (404) which
caused a 6-hour crashloop cycle (87 restarts) before the bug was
caught. Fixed to / — matches the startupProbe and readinessProbe
paths that were corrected earlier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.
Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>