52bf1ff3c78c52e93d3b2a0424a9d7727a1a1e9d
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c77ff07ce9 |
fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2004f9c5b2 |
fix(observability): relax vmagent liveness probe — was crash-looping every ~5m
The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
139a990ebc |
fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
df78d9ccd8 |
Add Prometheus metrics + vmagent push to obs.88oakapps.com
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via statement-level callbacks (no ctx plumbing needed). Storage and push clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the network round-trip points. Existing internal/monitoring metrics move to /metrics/legacy so the canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups. deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods and to the public obs endpoint over :443; the obs side runs VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate. docs/observability-plan.md captures the full plan including resource budget, instrumentation table, 4-step rollout, and migration triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |