Fix timeout middleware panic on proxy/WebSocket routes and worker healthcheck

The TimeoutMiddleware wraps the response writer in *http.timeoutWriter which
doesn't implement http.Flusher. When the admin reverse proxy or WebSocket
upgrader tries to flush, it panics and crashes the container (502 Bad Gateway).
Skip timeout for /admin, /_next, and /ws routes.

Also fix the Dockerfile HEALTHCHECK to detect the worker process — the worker
has no HTTP server so the curl-based check always failed, marking it unhealthy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-03-02 19:56:12 -06:00
parent 7690f07a2b
commit 7438dfd9b1
8 changed files with 12 additions and 1 deletions

260
docs/GO_TO_PROD.md Normal file
View File

@@ -0,0 +1,260 @@
# Go To Prod Plan
This document is a phased production-readiness plan for the Casera Go API repo.
Execute phases in order. Do not skip exit criteria.
## How To Use This Plan
1. Create an issue/epic per phase.
2. Track each checklist item as a task.
3. Only advance phases after all exit criteria pass in CI and staging.
## Phase 0 - Baseline And Drift Cleanup
Goal: eliminate known repo/config drift before hardening.
### Tasks
1. Fix stale admin build/run targets in [`Makefile`](/Users/treyt/Desktop/code/MyCribAPI_GO/Makefile) that reference `cmd/admin` (non-existent).
2. Align worker env vars in [`docker-compose.yml`](/Users/treyt/Desktop/code/MyCribAPI_GO/docker-compose.yml) with Go config:
- use `TASK_REMINDER_HOUR`
- use `OVERDUE_REMINDER_HOUR`
- use `DAILY_DIGEST_HOUR`
3. Align supported locales in [`internal/i18n/i18n.go`](/Users/treyt/Desktop/code/MyCribAPI_GO/internal/i18n/i18n.go) with translation files in [`internal/i18n/translations`](/Users/treyt/Desktop/code/MyCribAPI_GO/internal/i18n/translations).
4. Remove any committed secrets/keys from repo and history; rotate immediately.
### Validation
1. `go test ./...`
2. `go build ./cmd/api ./cmd/worker`
3. `docker compose config` succeeds.
### Exit Criteria
1. No stale targets or mismatched env keys remain.
2. CI and local boot work with a single source-of-truth config model.
---
## Phase 1 - Non-Negotiable CI Gates
Goal: block regressions by policy.
### Tasks
1. Update [`/.github/workflows/backend-ci.yml`](/Users/treyt/Desktop/code/MyCribAPI_GO/.github/workflows/backend-ci.yml) with required jobs:
- `lint` (`go vet ./...`, `gofmt -l .`)
- `test` (`go test -race -count=1 ./...`)
- `contract` (`go test -v -run "TestRouteSpecContract|TestKMPSpecContract" ./internal/integration/`)
- `build` (`go build ./cmd/api ./cmd/worker`)
2. Add `govulncheck ./...` job.
3. Add secret scanning (for example, gitleaks).
4. Set branch protection on `main` and `develop`:
- require PR
- require all status checks
- require at least one review
- dismiss stale reviews on new commits
### Validation
1. Open test PR with intentional formatting error; ensure merge is blocked.
2. Open test PR with OpenAPI/route drift; ensure merge is blocked.
### Exit Criteria
1. No direct merge path exists without passing all gates.
---
## Phase 2 - Contract, Data, And Migration Safety
Goal: guarantee deploy safety for API behavior and schema changes.
### Tasks
1. Keep OpenAPI as source of truth in [`docs/openapi.yaml`](/Users/treyt/Desktop/code/MyCribAPI_GO/docs/openapi.yaml).
2. Require route/schema updates in same PR as handler changes.
3. Add migration checks in CI:
- migrate up on clean DB
- migrate down one step
- migrate up again
4. Add DB constraints for business invariants currently enforced only in service code.
5. Add idempotency protections for webhook/job handlers.
### Validation
1. Run migration smoke test pipeline against ephemeral Postgres.
2. Re-run integration contract tests after each endpoint change.
### Exit Criteria
1. Schema changes are reversible and validated before merge.
2. API contract drift is caught pre-merge.
---
## Phase 3 - Test Hardening For Failure Modes
Goal: increase confidence in edge cases and concurrency.
### Tasks
1. Add table-driven tests for task lifecycle transitions:
- cancel/uncancel
- archive/unarchive
- complete/quick-complete
- recurring next due date transitions
2. Add timezone boundary tests around midnight and DST.
3. Add concurrency tests for race-prone flows in services/repositories.
4. Add fuzz/property tests for:
- task categorization predicates
- reminder schedule logic
5. Add unauthorized-access tests for media/document/task cross-residence access.
### Validation
1. `go test -race -count=1 ./...` stays green.
2. New tests fail when logic is intentionally broken (mutation spot checks).
### Exit Criteria
1. High-risk flows have explicit edge-case coverage.
---
## Phase 4 - Security Hardening
Goal: reduce breach and abuse risk.
### Tasks
1. Add strict request size/time limits for upload and auth endpoints.
2. Add rate limits for:
- login
- forgot/reset password
- verification endpoints
- webhooks
3. Ensure logs redact secrets/tokens/PII payloads.
4. Enforce least-privilege for runtime creds and service accounts.
5. Enable dependency update cadence with security review.
### Validation
1. Abuse test scripts for brute-force and oversized payload attempts.
2. Verify logs do not expose secrets under failure paths.
### Exit Criteria
1. Security scans pass and abuse protections are enforced in runtime.
---
## Phase 5 - Observability And Operations
Goal: make production behavior measurable and actionable.
### Tasks
1. Standardize request correlation IDs across API and worker logs.
2. Define SLOs:
- API availability
- p95 latency for key endpoints
- worker queue delay
3. Add dashboards + alerts for:
- 5xx error rate
- auth failures
- queue depth/retry spikes
- DB latency
4. Add dead-letter queue review and replay procedure.
5. Document incident runbooks in [`docs/`](/Users/treyt/Desktop/code/MyCribAPI_GO/docs):
- DB outage
- Redis outage
- push provider outage
- webhook backlog
### Validation
1. Trigger synthetic failures in staging and confirm alerts fire.
2. Execute at least one incident drill and capture MTTR.
### Exit Criteria
1. Team can detect and recover from common failures quickly.
---
## Phase 6 - Performance And Capacity
Goal: prove headroom before production growth.
### Tasks
1. Define load profiles for hot endpoints:
- `/api/tasks/`
- `/api/static_data/`
- `/api/auth/login/`
2. Run load and soak tests in staging.
3. Capture query plans for slow SQL and add indexes where needed.
4. Validate Redis/cache fallback behavior under cache loss.
5. Tune worker concurrency and queue weights from measured data.
### Validation
1. Meet agreed latency/error SLOs under target load.
2. No sustained queue growth under steady-state load.
### Exit Criteria
1. Capacity plan is documented with clear limits and scaling triggers.
---
## Phase 7 - Release Discipline And Recovery
Goal: safe deployments and verified rollback/recovery.
### Tasks
1. Adopt canary or blue/green deploy strategy.
2. Add automatic rollback triggers based on SLO violations.
3. Add pre-deploy checklist:
- migrations reviewed
- CI green
- queue backlog healthy
- dependencies healthy
4. Validate backups with restore drills (not just backup existence).
5. Document RPO/RTO targets and current measured reality.
### Validation
1. Perform one full staging rollback rehearsal.
2. Perform one restore-from-backup rehearsal.
### Exit Criteria
1. Deploy and rollback are repeatable, scripted, and tested.
---
## Definition Of Done (Every PR)
1. `go vet ./...`
2. `gofmt -l .` returns no files
3. `go test -race -count=1 ./...`
4. Contract tests pass
5. OpenAPI updated for endpoint changes
6. Migrations added and reversible for schema changes
7. Security impact reviewed for auth/uploads/media/webhooks
8. Observability impact reviewed for new critical paths
---
## Recommended Execution Timeline
1. Week 1: Phase 0 + Phase 1
2. Week 2: Phase 2
3. Week 3-4: Phase 3 + Phase 4
4. Week 5: Phase 5
5. Week 6: Phase 6 + Phase 7 rehearsal
Adjust timeline based on team size and release pressure, but keep ordering.