The TimeoutMiddleware wraps the response writer in *http.timeoutWriter which doesn't implement http.Flusher. When the admin reverse proxy or WebSocket upgrader tries to flush, it panics and crashes the container (502 Bad Gateway). Skip timeout for /admin, /_next, and /ws routes. Also fix the Dockerfile HEALTHCHECK to detect the worker process — the worker has no HTTP server so the curl-based check always failed, marking it unhealthy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.3 KiB
7.3 KiB
Go To Prod Plan
This document is a phased production-readiness plan for the Casera Go API repo. Execute phases in order. Do not skip exit criteria.
How To Use This Plan
- Create an issue/epic per phase.
- Track each checklist item as a task.
- Only advance phases after all exit criteria pass in CI and staging.
Phase 0 - Baseline And Drift Cleanup
Goal: eliminate known repo/config drift before hardening.
Tasks
- Fix stale admin build/run targets in
Makefilethat referencecmd/admin(non-existent). - Align worker env vars in
docker-compose.ymlwith Go config:- use
TASK_REMINDER_HOUR - use
OVERDUE_REMINDER_HOUR - use
DAILY_DIGEST_HOUR
- use
- Align supported locales in
internal/i18n/i18n.gowith translation files ininternal/i18n/translations. - Remove any committed secrets/keys from repo and history; rotate immediately.
Validation
go test ./...go build ./cmd/api ./cmd/workerdocker compose configsucceeds.
Exit Criteria
- No stale targets or mismatched env keys remain.
- CI and local boot work with a single source-of-truth config model.
Phase 1 - Non-Negotiable CI Gates
Goal: block regressions by policy.
Tasks
- Update
/.github/workflows/backend-ci.ymlwith required jobs:lint(go vet ./...,gofmt -l .)test(go test -race -count=1 ./...)contract(go test -v -run "TestRouteSpecContract|TestKMPSpecContract" ./internal/integration/)build(go build ./cmd/api ./cmd/worker)
- Add
govulncheck ./...job. - Add secret scanning (for example, gitleaks).
- Set branch protection on
mainanddevelop:- require PR
- require all status checks
- require at least one review
- dismiss stale reviews on new commits
Validation
- Open test PR with intentional formatting error; ensure merge is blocked.
- Open test PR with OpenAPI/route drift; ensure merge is blocked.
Exit Criteria
- No direct merge path exists without passing all gates.
Phase 2 - Contract, Data, And Migration Safety
Goal: guarantee deploy safety for API behavior and schema changes.
Tasks
- Keep OpenAPI as source of truth in
docs/openapi.yaml. - Require route/schema updates in same PR as handler changes.
- Add migration checks in CI:
- migrate up on clean DB
- migrate down one step
- migrate up again
- Add DB constraints for business invariants currently enforced only in service code.
- Add idempotency protections for webhook/job handlers.
Validation
- Run migration smoke test pipeline against ephemeral Postgres.
- Re-run integration contract tests after each endpoint change.
Exit Criteria
- Schema changes are reversible and validated before merge.
- API contract drift is caught pre-merge.
Phase 3 - Test Hardening For Failure Modes
Goal: increase confidence in edge cases and concurrency.
Tasks
- Add table-driven tests for task lifecycle transitions:
- cancel/uncancel
- archive/unarchive
- complete/quick-complete
- recurring next due date transitions
- Add timezone boundary tests around midnight and DST.
- Add concurrency tests for race-prone flows in services/repositories.
- Add fuzz/property tests for:
- task categorization predicates
- reminder schedule logic
- Add unauthorized-access tests for media/document/task cross-residence access.
Validation
go test -race -count=1 ./...stays green.- New tests fail when logic is intentionally broken (mutation spot checks).
Exit Criteria
- High-risk flows have explicit edge-case coverage.
Phase 4 - Security Hardening
Goal: reduce breach and abuse risk.
Tasks
- Add strict request size/time limits for upload and auth endpoints.
- Add rate limits for:
- login
- forgot/reset password
- verification endpoints
- webhooks
- Ensure logs redact secrets/tokens/PII payloads.
- Enforce least-privilege for runtime creds and service accounts.
- Enable dependency update cadence with security review.
Validation
- Abuse test scripts for brute-force and oversized payload attempts.
- Verify logs do not expose secrets under failure paths.
Exit Criteria
- Security scans pass and abuse protections are enforced in runtime.
Phase 5 - Observability And Operations
Goal: make production behavior measurable and actionable.
Tasks
- Standardize request correlation IDs across API and worker logs.
- Define SLOs:
- API availability
- p95 latency for key endpoints
- worker queue delay
- Add dashboards + alerts for:
- 5xx error rate
- auth failures
- queue depth/retry spikes
- DB latency
- Add dead-letter queue review and replay procedure.
- Document incident runbooks in
docs/:- DB outage
- Redis outage
- push provider outage
- webhook backlog
Validation
- Trigger synthetic failures in staging and confirm alerts fire.
- Execute at least one incident drill and capture MTTR.
Exit Criteria
- Team can detect and recover from common failures quickly.
Phase 6 - Performance And Capacity
Goal: prove headroom before production growth.
Tasks
- Define load profiles for hot endpoints:
/api/tasks//api/static_data//api/auth/login/
- Run load and soak tests in staging.
- Capture query plans for slow SQL and add indexes where needed.
- Validate Redis/cache fallback behavior under cache loss.
- Tune worker concurrency and queue weights from measured data.
Validation
- Meet agreed latency/error SLOs under target load.
- No sustained queue growth under steady-state load.
Exit Criteria
- Capacity plan is documented with clear limits and scaling triggers.
Phase 7 - Release Discipline And Recovery
Goal: safe deployments and verified rollback/recovery.
Tasks
- Adopt canary or blue/green deploy strategy.
- Add automatic rollback triggers based on SLO violations.
- Add pre-deploy checklist:
- migrations reviewed
- CI green
- queue backlog healthy
- dependencies healthy
- Validate backups with restore drills (not just backup existence).
- Document RPO/RTO targets and current measured reality.
Validation
- Perform one full staging rollback rehearsal.
- Perform one restore-from-backup rehearsal.
Exit Criteria
- Deploy and rollback are repeatable, scripted, and tested.
Definition Of Done (Every PR)
go vet ./...gofmt -l .returns no filesgo test -race -count=1 ./...- Contract tests pass
- OpenAPI updated for endpoint changes
- Migrations added and reversible for schema changes
- Security impact reviewed for auth/uploads/media/webhooks
- Observability impact reviewed for new critical paths
Recommended Execution Timeline
- Week 1: Phase 0 + Phase 1
- Week 2: Phase 2
- Week 3-4: Phase 3 + Phase 4
- Week 5: Phase 5
- Week 6: Phase 6 + Phase 7 rehearsal
Adjust timeline based on team size and release pressure, but keep ordering.