The TimeoutMiddleware wraps the response writer in *http.timeoutWriter which doesn't implement http.Flusher. When the admin reverse proxy or WebSocket upgrader tries to flush, it panics and crashes the container (502 Bad Gateway). Skip timeout for /admin, /_next, and /ws routes. Also fix the Dockerfile HEALTHCHECK to detect the worker process — the worker has no HTTP server so the curl-based check always failed, marking it unhealthy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
261 lines
7.3 KiB
Markdown
261 lines
7.3 KiB
Markdown
# Go To Prod Plan
|
|
|
|
This document is a phased production-readiness plan for the Casera Go API repo.
|
|
Execute phases in order. Do not skip exit criteria.
|
|
|
|
## How To Use This Plan
|
|
|
|
1. Create an issue/epic per phase.
|
|
2. Track each checklist item as a task.
|
|
3. Only advance phases after all exit criteria pass in CI and staging.
|
|
|
|
## Phase 0 - Baseline And Drift Cleanup
|
|
|
|
Goal: eliminate known repo/config drift before hardening.
|
|
|
|
### Tasks
|
|
|
|
1. Fix stale admin build/run targets in [`Makefile`](/Users/treyt/Desktop/code/MyCribAPI_GO/Makefile) that reference `cmd/admin` (non-existent).
|
|
2. Align worker env vars in [`docker-compose.yml`](/Users/treyt/Desktop/code/MyCribAPI_GO/docker-compose.yml) with Go config:
|
|
- use `TASK_REMINDER_HOUR`
|
|
- use `OVERDUE_REMINDER_HOUR`
|
|
- use `DAILY_DIGEST_HOUR`
|
|
3. Align supported locales in [`internal/i18n/i18n.go`](/Users/treyt/Desktop/code/MyCribAPI_GO/internal/i18n/i18n.go) with translation files in [`internal/i18n/translations`](/Users/treyt/Desktop/code/MyCribAPI_GO/internal/i18n/translations).
|
|
4. Remove any committed secrets/keys from repo and history; rotate immediately.
|
|
|
|
### Validation
|
|
|
|
1. `go test ./...`
|
|
2. `go build ./cmd/api ./cmd/worker`
|
|
3. `docker compose config` succeeds.
|
|
|
|
### Exit Criteria
|
|
|
|
1. No stale targets or mismatched env keys remain.
|
|
2. CI and local boot work with a single source-of-truth config model.
|
|
|
|
---
|
|
|
|
## Phase 1 - Non-Negotiable CI Gates
|
|
|
|
Goal: block regressions by policy.
|
|
|
|
### Tasks
|
|
|
|
1. Update [`/.github/workflows/backend-ci.yml`](/Users/treyt/Desktop/code/MyCribAPI_GO/.github/workflows/backend-ci.yml) with required jobs:
|
|
- `lint` (`go vet ./...`, `gofmt -l .`)
|
|
- `test` (`go test -race -count=1 ./...`)
|
|
- `contract` (`go test -v -run "TestRouteSpecContract|TestKMPSpecContract" ./internal/integration/`)
|
|
- `build` (`go build ./cmd/api ./cmd/worker`)
|
|
2. Add `govulncheck ./...` job.
|
|
3. Add secret scanning (for example, gitleaks).
|
|
4. Set branch protection on `main` and `develop`:
|
|
- require PR
|
|
- require all status checks
|
|
- require at least one review
|
|
- dismiss stale reviews on new commits
|
|
|
|
### Validation
|
|
|
|
1. Open test PR with intentional formatting error; ensure merge is blocked.
|
|
2. Open test PR with OpenAPI/route drift; ensure merge is blocked.
|
|
|
|
### Exit Criteria
|
|
|
|
1. No direct merge path exists without passing all gates.
|
|
|
|
---
|
|
|
|
## Phase 2 - Contract, Data, And Migration Safety
|
|
|
|
Goal: guarantee deploy safety for API behavior and schema changes.
|
|
|
|
### Tasks
|
|
|
|
1. Keep OpenAPI as source of truth in [`docs/openapi.yaml`](/Users/treyt/Desktop/code/MyCribAPI_GO/docs/openapi.yaml).
|
|
2. Require route/schema updates in same PR as handler changes.
|
|
3. Add migration checks in CI:
|
|
- migrate up on clean DB
|
|
- migrate down one step
|
|
- migrate up again
|
|
4. Add DB constraints for business invariants currently enforced only in service code.
|
|
5. Add idempotency protections for webhook/job handlers.
|
|
|
|
### Validation
|
|
|
|
1. Run migration smoke test pipeline against ephemeral Postgres.
|
|
2. Re-run integration contract tests after each endpoint change.
|
|
|
|
### Exit Criteria
|
|
|
|
1. Schema changes are reversible and validated before merge.
|
|
2. API contract drift is caught pre-merge.
|
|
|
|
---
|
|
|
|
## Phase 3 - Test Hardening For Failure Modes
|
|
|
|
Goal: increase confidence in edge cases and concurrency.
|
|
|
|
### Tasks
|
|
|
|
1. Add table-driven tests for task lifecycle transitions:
|
|
- cancel/uncancel
|
|
- archive/unarchive
|
|
- complete/quick-complete
|
|
- recurring next due date transitions
|
|
2. Add timezone boundary tests around midnight and DST.
|
|
3. Add concurrency tests for race-prone flows in services/repositories.
|
|
4. Add fuzz/property tests for:
|
|
- task categorization predicates
|
|
- reminder schedule logic
|
|
5. Add unauthorized-access tests for media/document/task cross-residence access.
|
|
|
|
### Validation
|
|
|
|
1. `go test -race -count=1 ./...` stays green.
|
|
2. New tests fail when logic is intentionally broken (mutation spot checks).
|
|
|
|
### Exit Criteria
|
|
|
|
1. High-risk flows have explicit edge-case coverage.
|
|
|
|
---
|
|
|
|
## Phase 4 - Security Hardening
|
|
|
|
Goal: reduce breach and abuse risk.
|
|
|
|
### Tasks
|
|
|
|
1. Add strict request size/time limits for upload and auth endpoints.
|
|
2. Add rate limits for:
|
|
- login
|
|
- forgot/reset password
|
|
- verification endpoints
|
|
- webhooks
|
|
3. Ensure logs redact secrets/tokens/PII payloads.
|
|
4. Enforce least-privilege for runtime creds and service accounts.
|
|
5. Enable dependency update cadence with security review.
|
|
|
|
### Validation
|
|
|
|
1. Abuse test scripts for brute-force and oversized payload attempts.
|
|
2. Verify logs do not expose secrets under failure paths.
|
|
|
|
### Exit Criteria
|
|
|
|
1. Security scans pass and abuse protections are enforced in runtime.
|
|
|
|
---
|
|
|
|
## Phase 5 - Observability And Operations
|
|
|
|
Goal: make production behavior measurable and actionable.
|
|
|
|
### Tasks
|
|
|
|
1. Standardize request correlation IDs across API and worker logs.
|
|
2. Define SLOs:
|
|
- API availability
|
|
- p95 latency for key endpoints
|
|
- worker queue delay
|
|
3. Add dashboards + alerts for:
|
|
- 5xx error rate
|
|
- auth failures
|
|
- queue depth/retry spikes
|
|
- DB latency
|
|
4. Add dead-letter queue review and replay procedure.
|
|
5. Document incident runbooks in [`docs/`](/Users/treyt/Desktop/code/MyCribAPI_GO/docs):
|
|
- DB outage
|
|
- Redis outage
|
|
- push provider outage
|
|
- webhook backlog
|
|
|
|
### Validation
|
|
|
|
1. Trigger synthetic failures in staging and confirm alerts fire.
|
|
2. Execute at least one incident drill and capture MTTR.
|
|
|
|
### Exit Criteria
|
|
|
|
1. Team can detect and recover from common failures quickly.
|
|
|
|
---
|
|
|
|
## Phase 6 - Performance And Capacity
|
|
|
|
Goal: prove headroom before production growth.
|
|
|
|
### Tasks
|
|
|
|
1. Define load profiles for hot endpoints:
|
|
- `/api/tasks/`
|
|
- `/api/static_data/`
|
|
- `/api/auth/login/`
|
|
2. Run load and soak tests in staging.
|
|
3. Capture query plans for slow SQL and add indexes where needed.
|
|
4. Validate Redis/cache fallback behavior under cache loss.
|
|
5. Tune worker concurrency and queue weights from measured data.
|
|
|
|
### Validation
|
|
|
|
1. Meet agreed latency/error SLOs under target load.
|
|
2. No sustained queue growth under steady-state load.
|
|
|
|
### Exit Criteria
|
|
|
|
1. Capacity plan is documented with clear limits and scaling triggers.
|
|
|
|
---
|
|
|
|
## Phase 7 - Release Discipline And Recovery
|
|
|
|
Goal: safe deployments and verified rollback/recovery.
|
|
|
|
### Tasks
|
|
|
|
1. Adopt canary or blue/green deploy strategy.
|
|
2. Add automatic rollback triggers based on SLO violations.
|
|
3. Add pre-deploy checklist:
|
|
- migrations reviewed
|
|
- CI green
|
|
- queue backlog healthy
|
|
- dependencies healthy
|
|
4. Validate backups with restore drills (not just backup existence).
|
|
5. Document RPO/RTO targets and current measured reality.
|
|
|
|
### Validation
|
|
|
|
1. Perform one full staging rollback rehearsal.
|
|
2. Perform one restore-from-backup rehearsal.
|
|
|
|
### Exit Criteria
|
|
|
|
1. Deploy and rollback are repeatable, scripted, and tested.
|
|
|
|
---
|
|
|
|
## Definition Of Done (Every PR)
|
|
|
|
1. `go vet ./...`
|
|
2. `gofmt -l .` returns no files
|
|
3. `go test -race -count=1 ./...`
|
|
4. Contract tests pass
|
|
5. OpenAPI updated for endpoint changes
|
|
6. Migrations added and reversible for schema changes
|
|
7. Security impact reviewed for auth/uploads/media/webhooks
|
|
8. Observability impact reviewed for new critical paths
|
|
|
|
---
|
|
|
|
## Recommended Execution Timeline
|
|
|
|
1. Week 1: Phase 0 + Phase 1
|
|
2. Week 2: Phase 2
|
|
3. Week 3-4: Phase 3 + Phase 4
|
|
4. Week 5: Phase 5
|
|
5. Week 6: Phase 6 + Phase 7 rehearsal
|
|
|
|
Adjust timeline based on team size and release pressure, but keep ordering.
|