honeyDueAPI

admin/honeyDueAPI

Fork 0

Commit Graph

Author	SHA1	Message	Date
Trey t	2004f9c5b2	fix(observability): relax vmagent liveness probe — was crash-looping every ~5m Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:39:23 -05:00
Trey t	139a990ebc	fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_, kube_deployment_, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:30:11 -05:00
Trey t	df78d9ccd8	Add Prometheus metrics + vmagent push to obs.88oakapps.com Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via statement-level callbacks (no ctx plumbing needed). Storage and push clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the network round-trip points. Existing internal/monitoring metrics move to /metrics/legacy so the canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups. deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods and to the public obs endpoint over :443; the obs side runs VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate. docs/observability-plan.md captures the full plan including resource budget, instrumentation table, 4-step rollout, and migration triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:16:17 -05:00

Author

SHA1

Message

Date

Trey t

2004f9c5b2

fix(observability): relax vmagent liveness probe — was crash-looping every ~5m

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

The previous probe had timeoutSeconds=1 which is too tight for the
shell pipeline (sh + wget + grep + comparison). On a busy node the
wget call regularly exceeded 1s, the exec timed out, and 3 consecutive
timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min,
causing brief gaps that made the Grafana "Pods up" panel render 0
whenever a refresh happened to coincide with a restart.

The relaxed probe still catches the original failure mode (zero healthy
targets) but only kills the pod after 10 full minutes of consecutive
failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min).

  timeoutSeconds: 1 → 5
  periodSeconds: 60 → 120
  failureThreshold: 3 → 5
  initialDelaySeconds: 120 → 180

Also added wget -T 4 inside the command so wget itself bounds its
network call to 4s — leaving 1s of slack within the 5s exec budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 00:39:23 -05:00

Trey t

139a990ebc

fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 00:30:11 -05:00

Trey t

df78d9ccd8

Add Prometheus metrics + vmagent push to obs.88oakapps.com

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.

Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.

deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.

docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-25 14:16:17 -05:00

3 Commits