4 Commits

Author SHA1 Message Date
Trey t 1347ffadf5 docs: presigned-URL upload flow + B2 lifecycle setup
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
09-storage.md:
  - Replaced the "Upload flow" section. The previous text described the
    multipart-via-API path that was removed in b7f8329. Now documents
    the three-step direct-to-B2 flow (presign → POST to B2 → attach
    via upload_ids[]) with an ASCII diagram and a server-side
    enforcement-points table.
  - Replaced the "Future: signed URLs" placeholder (since presigned
    URLs are now the present, not the future).
  - Added "Lifecycle and retention" subsections covering the
    pending_uploads cleanup cron (worker, 30 * * * *), the B2 bucket
    lifecycle as backstop (uploads/ prefix, 7-day hide + 1-day delete),
    and the still-open user-deletion cascade gap.

14-deployment-process.md:
  - Added a "One-time B2 bucket lifecycle (manual)" section explaining
    why the rule can't live in the deploy script (B2's S3 lifecycle
    API is partial), the exact rule to apply via the Backblaze
    console, and a verification command.

docs/deployment/README.md:
  - Updated the chapter 9 description to mention presigned-URL uploads.

README.md (root):
  - Added a paragraph under "Object storage" pointing to the new
    upload architecture and the relevant deployment-book chapters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:44:08 -07:00
Trey t 8d9ca2e6ed docs(deployment): rewrite migration prose for goose adoption
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00
Trey t 77cfcc0b27 docs: rewrite ch15 observability + cross-refs for the live obs stack
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:05:06 -05:00
Trey t 6f303dbbaa Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00