honeyDueAPI

admin/honeyDueAPI

Fork 0

Commit Graph

Author	SHA1	Message	Date
Trey t	e448ec66dc	docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover: - Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions, network/firewall, DNS, filesystem layout — all reflect the live OVH install instead of the historical Hetzner setup. - Section 6: canonical install-from-clean-boxes procedure (the literal commands run on 2026-06-03), so anyone can stand up a backup cluster by following along. - Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away, healthy-but-empty) and adds four new ones discovered during the OVH build: rbac.yaml not in 03-deploy.sh, namespace label missing from api metrics (use service="api"), cluster-label collision when two clusters push concurrently, worker double-firing on cutover. - Section 11.1: enumerates Tier-3 observability gaps surfaced while building the honeydue-eli5-overview dashboard (node-exporter not deployed, Traefik metrics off, push success counters absent, worker /metrics endpoint absent, cache hit rate uninstrumented, APNs latency uninstrumented). - Section 12: dated audit trail of cluster changes. Pure documentation; no code or manifest changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 09:34:35 -05:00
Trey t	139a990ebc	fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_, kube_deployment_, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:30:11 -05:00

Author

SHA1

Message

Date

Trey t

e448ec66dc

docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs

Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover:

- Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions,
  network/firewall, DNS, filesystem layout — all reflect the live OVH
  install instead of the historical Hetzner setup.
- Section 6: canonical install-from-clean-boxes procedure (the literal
  commands run on 2026-06-03), so anyone can stand up a backup cluster
  by following along.
- Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away,
  healthy-but-empty) and adds four new ones discovered during the OVH
  build: rbac.yaml not in 03-deploy.sh, namespace label missing from api
  metrics (use service="api"), cluster-label collision when two clusters
  push concurrently, worker double-firing on cutover.
- Section 11.1: enumerates Tier-3 observability gaps surfaced while
  building the honeydue-eli5-overview dashboard (node-exporter not
  deployed, Traefik metrics off, push success counters absent, worker
  /metrics endpoint absent, cache hit rate uninstrumented, APNs latency
  uninstrumented).
- Section 12: dated audit trail of cluster changes.

Pure documentation; no code or manifest changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-03 09:34:35 -05:00

Trey t

139a990ebc

fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 00:30:11 -05:00

2 Commits