Commit Graph

153 Commits

Author SHA1 Message Date
Trey t 52bf1ff3c7 perf(task): offload completion notification fan-out to Asynq worker
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
POST /api/task-completions/ was spending ~1.5-1.75s synchronously on
APNs push + SMTP email + B2 image fetches inside sendTaskCompletedNotification.
Per-user loop made it scale linearly with residence membership; one image
attached + one residence user is the 1.75s baseline observed in the live
honeydue-eli5-overview Grafana panel.

Replace the inline call (and the fire-and-forget goroutine in QuickComplete,
which violated the project's "no goroutines in handlers" rule) with an
Asynq job:

  - new task type notification:task_completed (worker/scheduler.go)
  - new payload {task_id, completion_id} — IDs only, worker re-reads
    canonical state from Postgres so concurrent edits between enqueue
    and dequeue are reflected
  - new HandleTaskCompletedNotification on jobs.Handler delegates to
    TaskService.SendTaskCompletedNotificationByID
  - new dispatchTaskCompletedNotification in task_service.go picks
    between enqueue (preferred) and inline (fallback) when Redis is
    unreachable or the enqueuer isn't wired (tests / local dev)

Other changes required to wire it up:

  - widen worker.NewTaskClient signature to accept asynq.RedisClientOpt
    so the file-mounted Redis password (audit HIGH-1) can be supplied;
    no prior callers, no breakage
  - extend worker.Enqueuer interface with EnqueueTaskCompletedNotification
  - add TaskEnqueuer field to router.Dependencies; wire from cmd/api/main.go
    with the standard typed-nil interface guard
  - wire a worker-side TaskService in cmd/worker/main.go so the handler
    can use the shared SendTaskCompletedNotificationByID implementation
    (storage service shared with the existing upload-cleanup wiring)

Expected impact on POST /api/task-completions/ p50:
  ~1.75s -> ~120-170ms (DB + tx + Asynq enqueue only)

Notifications still deliver; they just go via the worker instead of in
the request path. MaxRetry=3; "row not found" returns nil so a deleted
task/completion doesn't churn the retry loop.

All 31 test packages pass. No DB migrations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 09:34:52 -05:00
Trey t 3d3ba84df0 fix(auth): delete the Kratos identity on account deletion
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Account deletion removed all local data but left the Ory Kratos
identity intact — an orphaned identity that can still authenticate.
Close the gap:

- kratos.Client gains the admin API: NewClient(publicURL, adminURL)
  and DeleteIdentity (DELETE /admin/identities/{id}; a 404 is treated
  as success so a retry after a partial failure is idempotent).
- AuthService.DeleteAccount deletes the Kratos identity FIRST; if that
  call fails it aborts before touching local data, so the operation is
  retryable rather than partially applied.
- KRATOS_ADMIN_URL config (default http://kratos:4434) + router wiring.
- kratos NetworkPolicy split: the api pods may now reach the admin API
  :4434 (Traefik still reaches only the public API :4433).
- kratos CORS: allow_credentials + OPTIONS so the web browser flows
  (ory_kratos_session cookie) work; origins stay an explicit allowlist.
- Regression tests: identity teardown happens, and a Kratos failure
  aborts the deletion instead of orphaning local data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:55:33 -05:00
Trey t 81578f6e27 feat(auth): replace hand-rolled auth with Ory Kratos — phase 2 backend
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Delegates all credential management (login, register, password reset,
email verification, social sign-in) to Ory Kratos. The Go API now acts
as a resource server: the new KratosAuth middleware validates sessions
against the Kratos whoami endpoint, writes the local User mirror into
Echo context, and all existing domain handlers continue working
unchanged. Hand-rolled token auth, AuthToken model, apple_auth/
google_auth services, and the auth refresh flow are removed. Tests are
updated to use the fake-token middleware pattern so existing integration
assertions require no rewrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:55:56 -05:00
Trey t c77ff07ce9 fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps),
tracked in deploy-k3s/SECURITY.md, plus fixes from two independent
post-remediation reviews.

Auth & sessions:
- SHA-256 hashed auth-token storage (C1); prior-token cache eviction on
  re-login (MEDIUM-1)
- local Google JWKS verification, iss/aud/exp checks (C2/C3)
- constant-time login + generic errors (L1/LIVE-L11/LIVE-L13)
- per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3)
- verified-email gating, login rate limiting (LIVE-L19, H1-H3)

IAP & webhooks:
- Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6)
- migrations 000003-000006 (token hashing, IAP replay, audit_log +
  webhook_event_log table creation, append-only audit log)

Authorization & races:
- file-ownership owner-OR-member fix (C7), atomic share-code join
  (C9/H9), device-token reassignment (C8/LOW-3)

Secrets & deploy:
- secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis
  password out of the ConfigMap (HIGH-1); B2 keys reconciled
- digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics
  lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban +
  unattended-upgrades at provision; secret-rotation runbook

Build, vet, and the full test suite (incl. -race) pass; the goose
migration chain is verified against PostgreSQL 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:28:33 -05:00
Trey t 7cc5448a7c fix(uploads): switch from S3 POST policy to presigned PUT
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backblaze B2's S3-compatible endpoint does not implement the S3 POST
Object operation. It returns HTTP 501 to every POST regardless of URL
style — both path-style (https://s3.<region>.backblazeb2.com/<bucket>/)
and virtual-hosted-style (https://<bucket>.s3.<region>.backblazeb2.com/).

Yesterday's BucketLookupDNS fix produced virtual-hosted URLs, which is
correct for AWS but doesn't help here — B2 rejects POST on either form.
Verified with `curl -X POST https://...backblazeb2.com/honeyDueProd/`
returning 501 directly, with no signature involved.

Replace minio-go's PresignedPostPolicy with PresignHeader + http.MethodPut.
The signed URL now points at a single PUT endpoint, with Content-Type
and Content-Length signed via headers — B2/S3/MinIO all accept it. Drop
the min/max content-length range (we sign exactly one length now);
post-upload size verification still happens in VerifyAndClaim via HEAD.

Response shape:
  - URL  (was: signed POST endpoint)  → now: signed PUT URL
  - Fields → renamed to Headers; client sends them as request headers,
    not multipart form parts
  - Method (new): always "PUT", emitted explicitly so clients don't
    have to hardcode

Companion KMP/iOS commits switch the client paths from multipart POST
to single PUT. Existing builds in the field will need to be rebuilt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 15:41:48 -05:00
Trey t 4efc87559a fix(uploads): force virtual-hosted-style URLs for B2 presigned POST
Backblaze B2's S3-compatible endpoint only implements POST Object on
virtual-hosted-style URLs (https://<bucket>.s3.<region>.backblazeb2.com/).
Path-style POST returns HTTP 501 Not Implemented.

minio-go's BucketLookupAuto only flips to virtual-hosted for AWS,
Google, and Aliyun endpoints — for B2 it falls through to path-style,
which is why every PresignedPostPolicy() call has been handing the
mobile clients a URL that B2 then refuses with 501.

Force BucketLookupDNS only when the endpoint is backblazeb2.com so
MinIO dev (no DNS for arbitrary buckets at minio:9000) keeps its
path-style default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:34:05 -05:00
Trey t b7f83293b8 refactor(uploads): drop legacy multipart code paths
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The presigned-URL upload flow (POST /api/uploads/presign + direct B2 POST
+ upload_ids[] in entity creation) is now the only image upload path. The
legacy multipart routes and DTO fields used by older clients are removed:

Removed:
  - POST /api/uploads/image/        (legacy multipart upload → URL)
  - POST /api/uploads/document/     (legacy multipart upload → URL)
  - POST /api/uploads/completion/   (legacy multipart upload → URL)
  - Multipart branch in POST /api/task-completions/ (now JSON-only)
  - CreateTaskCompletionRequest.ImageURLs DTO field
  - UpdateTaskCompletionRequest.ImageURLs DTO field
  - CreateDocumentRequest.ImageURLs DTO field
  - Service-layer ImageURLs loops in task_service.CreateCompletion,
    task_service.UpdateCompletion, document_service.CreateDocument
  - Tests exercising the removed paths
  - Now-unused imports (strings/time/decimal) in task_handler.go

Kept:
  - DELETE /api/uploads/  (orphan-cleanup endpoint, still useful)
  - POST /api/uploads/presign/  (the new path)
  - POST /api/documents/:id/images/  (uses storage_service.Upload directly,
    same multipart pattern but separate code path; deferred for now)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:19:21 -07:00
Trey t 29c9014a33 feat(uploads): direct-to-B2 presigned uploads with content-length-range policy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Replaces the multipart-via-API path for image uploads with a three-step
direct-to-storage flow:

  1. Client POSTs /api/uploads/presign with content_length + content_type;
     server validates size (10 MB cap), mime allow-list per category, rate
     limit (50/hour/user via Redis sliding window), and concurrent unclaimed
     cap (10 in-flight per user). On success it persists a pending_uploads
     row, signs an S3 POST policy with content-length-range bound to the
     claimed length ±256 bytes, and returns the URL+fields.
  2. Client POSTs the bytes directly to B2 using the signed policy. B2
     enforces size, content-type, and key match before accepting.
  3. Client passes upload_ids[] to /api/task-completions/ or /api/documents/.
     Service HEADs each B2 object, verifies size matches expected_bytes
     within slack, marks pending_uploads claimed_at, and creates the
     associated TaskCompletionImage / DocumentImage rows.

Bytes never traverse our API server. The 1 MB Echo BodyLimit middleware
that was rejecting all task-completion image uploads becomes irrelevant
for this path. Existing multipart endpoints stay functional alongside,
soak-testing the new path before legacy removal.

Cleanup:
  - cmd/worker registers a new hourly cron (TypeUploadCleanup, "30 * * * *")
    that reaps pending_uploads where claimed_at IS NULL AND expires_at < NOW().
    Reaps both the B2 object and the row.
  - B2 bucket lifecycle rule on `uploads/` prefix (7 days hide → 1 day delete)
    documented in deploy-k3s/manifests/b2-lifecycle.md as a backstop.

Schema:
  - migrations/000002_pending_uploads.sql adds the table + partial index for
    cleanup + nullable pending_upload_id FKs on task_taskcompletionimage and
    task_documentimage.

Policy (single tier, no free/pro split):
  - 10 MB cap per upload
  - 50 presigns/hour/user
  - 10 concurrent unclaimed uploads/user
  - allow-list: jpeg/png/heic/heif/webp for image categories;
    + pdf for document_file

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:36:42 -07:00
Trey t 9bee436e86 perf(subscription-status): cache + parallelize + invalidate on mutations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
GET /api/subscription/status/ was the slowest endpoint in the API at
p50≈1750ms / p95≈2425ms — about 12× the floor for our cluster→Neon
geography. Jaeger traces showed seven sequential SQL queries each
costing roughly one transatlantic RTT (~110ms), with the actual queries
running in 0.073ms at the database. Pure network serialization, not slow
SQL.

Three changes, in order of leverage:

1. Cache the assembled SubscriptionStatusResponse per-user in Redis with
   a 5-minute TTL. Hot path collapses to a single Redis GET (~5ms) on
   warm reads; the TTL is a safety net against missed invalidations.

2. Parallelize the three independent COUNT queries in getUserUsage
   (task_task / task_contractor / task_document) via golang.org/x/sync
   errgroup. Three RTTs collapse to one. Also dropped the redundant
   residence_residence COUNT — len(residenceIDs) from FindResidenceIDsByOwner
   is the same number, no need to re-query.

3. Wire explicit invalidation into every mutation that could change a
   user's response — residence/task/contractor/document CRUD,
   residence membership changes (JoinWithCode, RemoveUser, DeleteResidence),
   and every subscription tier flip across the IAP/Stripe/webhook surface.
   Residence-scoped invalidations fan out to every user with access via a
   new ResidenceRepository.FindUserIDsByResidence helper, so members of a
   shared residence don't see stale `usage` numbers when another member
   adds a task.

Net effect: warm path goes from ~1350ms to ~5ms (Redis hit). Cold path
goes from ~1350ms to ~250-450ms (5 sequential queries → 2 phases:
residence IDs lookup, then parallel task/contractor/document counts).

Also fixed a pre-existing CheckLimit signature drift in
internal/integration/subscription_is_free_test.go that was blocking the
package build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:00:23 -07:00
Trey t 0798ae8d74 fix(testutil): use shared-cache SQLite so concurrent reads see same DB
SetupTestDB used `sqlite.Open(":memory:")`, which creates a *separate*
in-memory database for every connection in GORM's pool. Sequential tests
never noticed because the pool keeps reusing one connection — but the
moment any code path issued concurrent reads (e.g. errgroup-driven
parallel COUNT queries), a goroutine could pull a fresh connection, see
no migrated tables, and explode with "no such table".

Switched to `file:testdb_<n>?mode=memory&cache=shared&_journal=memory`
with a per-test atomic counter so every connection in the pool sees the
same in-memory DB and tests stay isolated from each other through the
unique cache namespace. As a bonus, this also resolves the pre-existing
TestTaskHandler_QuickComplete flake — same root cause, just intermittent
because the pool occasionally handed out a second connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:00:03 -07:00
Trey t 8fce568532 fix(config): replace sync.Once reset-from-Do with mutex
Load()'s validation-failure path reassigned cfgOnce = sync.Once{} from
inside Do(). When Do() returned and tried to unlock the original mutex,
the Once struct had already been replaced with a fresh one whose mutex
was unlocked, panicking with "sync: unlock of unlocked mutex" on every
boot where any required env var was missing or invalid.

Replaced the Once with a plain sync.Mutex around a nil-check on the
package-level cfg, building the candidate into a local first and only
assigning to cfg after validate() succeeds. Same caching semantics, no
race, and a failed Load() leaves cfg nil so the next caller retries
cleanly.

Also documented AppleAuthConfig.TeamID as currently dead — it's loaded
from APPLE_TEAM_ID but no service reads it. Wire-up point noted for
when Sign in with Apple revocation/refresh is added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:35:54 -07:00
Trey t 12b2f9d43b Adopt pressly/goose for schema migrations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
  link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
  Neon's pgbouncer transaction-mode pooler — turns out this is a
  known and documented limitation that bites golang-migrate too

Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
  failure rolls back cleanly instead of leaving a "dirty" version
  state requiring manual force-reset (golang-migrate's known
  weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
  single Kubernetes Job, which IS the singleton process. No advisory
  lock needed at all.

Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
  DB at adoption, stripped of psql-only directives that block goose's
  bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
  their effects folded into this baseline; deleted from the live tree
  but preserved in git history at 58e6997.
- Dockerfile installs `goose v3.22.1` at build time and copies the
  binary into the api image. The migrate Job reuses the api image with
  command=goose, so no separate image to build/push/version.
- deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips
  the -pooler segment from DB_HOST (advisory lock won't survive
  pgbouncer transaction-mode), runs `goose up`, exits.
- deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the
  fresh one, `kubectl wait --for=condition=complete --timeout=10m`,
  then proceeds with api/worker rollout. Job failure aborts the deploy
  before any new app pod sees a stale schema.
- internal/database/database.go::RequireSchemaApplied checks
  goose_db_version on startup. api/worker refuse to boot if the
  table is missing or its latest row has is_applied=false — the
  fail-fast for "operator forgot to run migrate."
- Makefile: migrate-up / migrate-down / migrate-status / migrate-new
  for local workflow.

Production DB was bootstrapped manually:
  $ goose -dir migrations postgres "$DSN" version  # creates table
  $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

Smoke test against fresh Postgres locally: 50 user tables created in
284ms via `goose up`, version_id=1 + is_applied=t recorded.

Verified the local goose CLI talks to prod successfully:
  $ goose ... status
  Applied At                  Migration
  =======================================
  Mon Apr 27 03:43:55 2026 -- 000001_init.sql

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:46:36 -05:00
Trey t d96f317d20 Revert "Fix migration deadlock under Neon pooler"
This reverts commit 30966c6f5e.
2026-04-26 22:22:07 -05:00
Trey t 30966c6f5e Fix migration deadlock under Neon pooler
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Today's pooler-endpoint switch broke MigrateWithLock: pg_advisory_lock is
session-scoped, but PgBouncer transaction-mode releases the underlying
Postgres session after every transaction. The lock was being released
the moment we acquired it, and on the next pod's startup the migration
either deadlocked or proceeded without serialization. Visible as
\"Acquiring migration advisory lock...\" hanging until the startup
probe killed the pod (as just happened on the b67f7f9 deploy).

Fix: open a parallel *gorm.DB pointed at the *direct* Neon endpoint
(DB_HOST with the -pooler segment stripped) for migrations only. That
keeps a real persistent session, so pg_advisory_lock works correctly.
The migration runs against this direct connection; the runtime pool
keeps using the pooler for everything else.

Side effects:
- Migrate() now delegates to migrate(target *gorm.DB) which lets
  MigrateWithLock pass the direct DB. Tests on SQLite still call
  Migrate() through the global pool unchanged.
- Migration DB uses MaxOpenConns=1, MaxIdleConns=1 — we just hold
  the lock on it, no connection pressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 21:53:52 -05:00
Trey t b67f7f9e6b Cache SubscriptionSettings + cut monitoring poll noise
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Trace data revealed subscription_subscriptionsettings was consuming
1,983s of cumulative DB time per day (180× more than the next-largest
table) for a 32-byte singleton row of admin-toggleable global flags.
Root cause was a 30-second poll loop in monitoring.Service per pod
plus uncached reads on every authed status check / CreateResidence /
Stripe webhook. Fix is layered:

1. Redis cache for SubscriptionSettings — same shape as the
   residence-IDs cache. 30-min TTL, explicit invalidation on admin
   write. New CacheService.{Cache,GetCached,Invalidate}SubscriptionSettings
   plus a cachedSubscriptionSettings helper in services/.

2. SubscriptionService, StripeService, and both admin handlers
   (settings + limitations) now read through the cache. Admin write
   handlers invalidate so toggles propagate cluster-wide within ms
   instead of waiting for the TTL.

3. monitoring.Service.syncSettingsFromDB also reads from Redis first
   (raw redis.Client to avoid a services→monitoring import cycle).
   Polling interval bumped 30s → 5min. Combined with Redis-shared
   cache, cluster-wide DB hits from this poll go from ~480/hour to
   ~2/hour — a 240× reduction.

4. StripeService.CreateCheckoutSession now takes ctx so the cached
   settings span (and the Stripe webhook trace) stay attached to the
   request. Handler call site updated.

5. Admin handlers' direct h.db.First calls switched to
   db.WithContext(ctx) so the resulting orphan SQL spans nest under
   the admin request span in Jaeger.

Net DB query rate for subscription_subscriptionsettings should drop
from 0.101/sec to ~0/sec with occasional invalidation-driven refills,
and the table's cumulative DB time from 1,983s/day to ~10s/day.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 21:29:30 -05:00
Trey t 88fb1751c7 Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Stack of optimizations against the same Hetzner→Neon transatlantic link.
The trace revealed every visible ms was network/proxy overhead — DB
execution itself is sub-millisecond per query (verified via EXPLAIN
ANALYZE: index scans on every hot path).

Connection layer:
- DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer
  transaction-mode keeps backend Postgres connections warm so we no
  longer pay the ~110ms Postgres-startup RTT on cold queries.
- GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s,
  MaxIdleTime added (default 0 = never close idle).
- Eager pool warm-up at boot via parallel pings — first user request
  no longer pays the ~440ms TCP+TLS+startup handshake.
- Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will
  evict cold keys instead of erroring at the 256MB limit.

Auth layer:
- TokenCacheTTL 5min → 1 hour (Redis token cache).
- UserCacheTTL 30s → 5min (in-memory User cache, per pod).
- UserCache gains a 5,000-entry LRU cap so a flood of unique users
  can't blow up pod RSS. ~5MB worst-case per pod.
- Token + user lookup collapsed from 2 GORM Preload queries into a
  single INNER JOIN. Saves 1 RTT per cold-cache request.
- Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL
  spans nest under the parent HTTP request in Jaeger.

Service layer:
- TaskService.ListTasks: replaced two-step
  FindResidenceIDsByUser → GetKanbanDataForMultipleResidences
  with a single GetKanbanDataForUser that uses a Postgres subquery
  for residence-access. One round-trip instead of two.
- New CacheService residence-IDs cache: \"residence_ids_user:<id>\"
  with 5-min TTL. Wired into Task/Residence/Contractor/Document
  services for the four hot read paths that need this list.
- Cache invalidation on every relevant mutation: CreateResidence,
  DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence
  invalidates every member of the residence, not just the owner.

What this stacks up to (Hetzner→Neon, before US migration):
  Path                                 Before        After (target)
  Cache-warm authed read               ~800ms        ~100-200ms
  Cache-cold authed read (1st in 1hr)  ~2500ms       ~500-700ms
  First request after deploy           ~2500ms       ~700-900ms

The endgame US-region migration on top of this gets us to ~30-50ms
warm-cache, but we're shippable at ~150ms warm right now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:13:50 -05:00
Trey t d9b5f85c3d Thread ctx through auth middleware DB lookups
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The auth middleware's m.db.Preload + m.db.First calls were running without
ctx, so on cache miss the resulting SQL queries appeared as orphan
gorm.Query / gorm.Row spans in Jaeger. Now they nest under the parent
HTTP request span like every other repo call.

This was the last orphaned-SQL source on the request hot path. Combined
with the seven service migrations, every authenticated API call now
produces a fully-nested flame graph: HTTP → auth-token-lookup (cache hit)
or HTTP → auth-token-SQL (cache miss) → service → service-SQL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:36:47 -05:00
Trey t e881d37de0 Migrate Auth/Contractor/Document/Notification/Subscription services to ctx
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Every public method on these five services now takes ctx context.Context as
the first arg and routes its repo calls through .WithContext(ctx). With
TaskService and ResidenceService already migrated, this means every
in-process service that touches Postgres now produces a flame graph in
Jaeger where the SQL spans nest under the parent HTTP request span.

Endpoints now fully traced (HTTP → service → SQL):
- /api/auth/login, /register, /logout, /me, /verify-email, /resend-verification
- /api/auth/forgot-password, /verify-reset, /reset-password, /update-profile
- /api/contractors/* (CRUD + favorite + by-residence + tasks)
- /api/documents/* (CRUD + activate/deactivate + image upload/delete)
- /api/notifications/* (list, count, mark-read, prefs, devices)
- /api/subscription/* (status, purchase, cancel, triggers, promotions)
- All previously-migrated /api/tasks/* and /api/residences/* paths

Internal helpers also threaded:
- TaskService.sendTaskCompletedNotification → forwards ctx
- TaskService.UpdateUserTimezone → forwards ctx to NotificationService
- ResidenceService.CreateResidence → forwards ctx to SubscriptionService.CheckLimit
- NotificationService.registerAPNSDevice / registerGCMDevice → both take ctx

~75 method signatures, ~120 handler/test call sites updated. Tests pass
green; the only failure is the pre-existing flaky TaskHandler_QuickComplete
SQLite race that fails ~60% of runs on master.

Step 3 of the observability plan is now genuinely complete: every API
endpoint backed by a Go service emits a per-request flame graph with
HTTP → service → SQL spans, plus B2/APNs/FCM/asynq spans where applicable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:26:21 -05:00
Trey t 65a9aae4e5 Migrate TaskService + ResidenceService to ctx-aware repos
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Every public method on TaskService and ResidenceService now takes
ctx context.Context as the first arg and routes its repo calls through
.WithContext(ctx). With otelgorm registered, this means every API
endpoint backed by these two services produces a flame graph in Jaeger
where the SQL spans nest under the parent HTTP request span — instead
of appearing as orphaned queries.

Endpoints now fully traced (HTTP → service → SQL):
- GET    /api/tasks/                       (already shipped)
- GET    /api/tasks/by-residence/:id/      (already shipped)
- GET    /api/tasks/:id/
- POST   /api/tasks/
- POST   /api/tasks/bulk/
- PUT    /api/tasks/:id/
- DELETE /api/tasks/:id/
- POST   /api/tasks/:id/in-progress/
- POST   /api/tasks/:id/cancel/
- POST   /api/tasks/:id/uncancel/
- POST   /api/tasks/:id/archive/
- POST   /api/tasks/:id/unarchive/
- POST   /api/tasks/:id/complete/
- POST   /api/tasks/:id/quick-complete/
- GET    /api/tasks/completions/* (CRUD)
- GET    /api/static_data/ (categories, priorities, frequencies)
- GET    /api/residences/
- GET    /api/residences/my/
- GET    /api/residences/summary/
- GET    /api/residences/:id/
- POST   /api/residences/
- PUT    /api/residences/:id/
- DELETE /api/residences/:id/
- Share-code + member management endpoints
- GET    /api/residences/:id/report/

Mechanical work: ~50 method signatures, ~80 handler call sites,
~25 test call sites updated. Internal sendTaskCompletedNotification
helper also takes ctx so background notification SQL nests correctly.

The remaining services (ContractorService, DocumentService,
AuthService, NotificationService, SubscriptionService) follow the same
pattern; they continue to emit untraced SQL until migrated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:04:01 -05:00
Trey t 3f5bf21e09 tracing: bump semconv to v1.40.0 to match runtime resource schema
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Pods crashed at startup with "build resource: conflicting Schema URL:
https://opentelemetry.io/schemas/1.40.0 and https://opentelemetry.io/schemas/1.27.0"
because resource.Default() in the SDK targets v1.40.0. Aligning here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:35:46 -05:00
Trey t bc3da007db Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider
that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling
is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod,
overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api
and honeydue-worker. otelecho.Middleware opens a span per HTTP request.

Step 2 — Manual spans: storage_service.Upload now takes ctx and emits
storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket,
result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token
spans with topic, status_code, reason. Asynq middleware emits
asynq.handle:<task_type> per job with retry/payload attrs and records
asynq_job_duration_seconds.

Step 3 — Database: otelgorm plugin registered in database.Connect, so
any SQL emitted via db.WithContext(ctx) attaches to the request span.
Every repository now exposes WithContext(ctx) *XRepository as the
migration helper. TaskService.ListTasks and GetTasksByResidence are
migrated end-to-end (ctx threaded through handler → service → repo);
remaining services adopt the same pattern incrementally — pre-migration
methods still emit untraced SQL via the unchanged db field.

OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env →
honeydue-secrets → api+worker Deployments via secretKeyRef (optional).
02-setup-secrets.sh sources them from prod.env on next run; manifests
mark both env vars optional so the deployment rolls without traces if
the secret is absent.

ch15 observability doc now lists what produces spans today vs the
remaining migration work, with the explicit per-method pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:28:05 -05:00
Trey t d3708e6c72 Fix /metrics double-gzip + deploy script for amd64 build
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so
vmagent received double-compressed bytes that failed the Prometheus
parser with binary garbage. Skipping /metrics in the gzip Skipper.

Three deploy-script fixes uncovered while shipping this:
- _config.sh had backticks around \"kubectl get cm\" inside the python
  heredoc, which bash treated as command substitution when KUBECONFIG
  was set. Quoted the literal instead.
- 03-deploy.sh now passes --platform linux/amd64 to all docker builds
  so arm64 Macs don't push images that fail with \"exec format error\"
  on the Hetzner CX nodes.
- OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of
  the actual deploy/prod.env at the repo root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:42:15 -05:00
Trey t df78d9ccd8 Add Prometheus metrics + vmagent push to obs.88oakapps.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.

Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.

deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.

docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:17 -05:00
Trey t 6f303dbbaa Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00
Trey T 4ec4bbbfe8 Auto-seed lookups + admin + templates on first API boot
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Add a data_migration that runs seeds/001_lookups.sql,
seeds/003_admin_user.sql, and seeds/003_task_templates.sql exactly
once on startup and invalidates the Redis seeded_data cache afterwards
so /api/static_data/ returns fresh results. Removes the need to
remember `./dev.sh seed-all`; the data_migrations tracking row prevents
re-runs, and each INSERT uses ON CONFLICT DO UPDATE so re-execution is
safe.
2026-04-15 08:37:55 -05:00
Trey t 237c6b84ee Onboarding: template backlink, bulk-create endpoint, climate-region scoring
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Clients that send users through a multi-task onboarding step no longer
loop N POST /api/tasks/ calls and no longer create "orphan" tasks with
no reference to the TaskTemplate they came from.

Task model
- New task_template_id column + GORM FK (migration 000016)
- CreateTaskRequest.template_id, TaskResponse.template_id
- task_service.CreateTask persists the backlink

Bulk endpoint
- POST /api/tasks/bulk/ — 1-50 tasks in a single transaction,
  returns every created row + TotalSummary. Single residence access
  check, per-entry residence_id is overridden with batch value
- task_handler.BulkCreateTasks + task_service.BulkCreateTasks using
  db.Transaction; task_repo.CreateTx + FindByIDTx helpers

Climate-region scoring
- templateConditions gains ClimateRegionID; suggestion_service scores
  residence.PostalCode -> ZipToState -> GetClimateRegionIDByState against
  the template's conditions JSON (no penalty on mismatch / unknown ZIP)
- regionMatchBonus 0.35, totalProfileFields 14 -> 15
- Standalone GET /api/tasks/templates/by-region/ removed; legacy
  task_tasktemplate_regions many-to-many dropped (migration 000017).
  Region affinity now lives entirely in the template's conditions JSON

Tests
- +11 cases across task_service_test, task_handler_test, suggestion_
  service_test: template_id persistence, bulk rollback + cap + auth,
  region match / mismatch / no-ZIP / unknown-ZIP / stacks-with-others

Docs
- docs/openapi.yaml: /tasks/bulk/ + BulkCreateTasks schemas, template_id
  on TaskResponse + CreateTaskRequest, /templates/by-region/ removed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:23:57 -05:00
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00
Trey t ca818e8478 Merge branch 'master' of github.com:akatreyt/MyCribAPI_GO
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
2026-04-01 20:45:43 -05:00
Trey T bec880886b Coverage priorities 1-5: test pure functions, extract interfaces, mock-based handler tests
- Priority 1: Test NewSendEmailTask + NewSendPushTask (5 tests)
- Priority 2: Test customHTTPErrorHandler — all 15+ branches (21 tests)
- Priority 3: Extract Enqueuer interface + payload builders in worker pkg (5 tests)
- Priority 4: Extract ClassifyFile/ComputeRelPath in migrate-encrypt (6 tests)
- Priority 5: Define Handler interfaces, refactor to accept them, mock-based tests (14 tests)
- Fix .gitignore: /worker instead of worker to stop ignoring internal/worker/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-01 20:30:09 -05:00
Trey t 2e10822e5a Add S3-compatible storage backend (B2, MinIO, AWS S3)
Introduces a StorageBackend interface with local filesystem and S3
implementations. The StorageService delegates raw I/O to the backend
while keeping validation, encryption, and URL generation unchanged.

Backend selection is config-driven: set B2_ENDPOINT + B2_KEY_ID +
B2_APP_KEY + B2_BUCKET_NAME for S3 mode, or STORAGE_UPLOAD_DIR for
local mode. STORAGE_USE_SSL=false for in-cluster MinIO (HTTP).

All existing tests pass unchanged — the local backend preserves
identical behavior to the previous direct-filesystem implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:31:24 -05:00
Trey T 00fd674b56 Remove dead climate region code from suggestion engine
Suggestion engine now purely uses home profile features (heating,
cooling, pool, etc.) for template matching. Climate region field
and matching block removed — ZIP code is no longer collected.
2026-03-30 11:19:04 -05:00
Trey T cb7080c460 Smart onboarding: residence home profile + suggestion engine
14 new optional residence fields (heating, cooling, water heater, roof,
pool, sprinkler, septic, fireplace, garage, basement, attic, exterior,
flooring, landscaping) with JSONB conditions on templates.

Suggestion engine scores templates against home profile: string match
+0.25, bool +0.3, property type +0.15, universal base 0.3. Graceful
degradation from minimal to full profile info.

GET /api/tasks/suggestions/?residence_id=X returns ranked templates.
54 template conditions across 44 templates in seed data.
8 suggestion service tests.
2026-03-30 09:02:03 -05:00
Trey T 4c9a818bd9 Comprehensive TDD test suite for task logic — ~80 new tests
Predicates (20 cases): IsRecurring, IsOneTime, IsDueSoon,
HasCompletions, GetCompletionCount, IsUpcoming edge cases

Task creation (10): NextDueDate initialization, all frequency types,
past dates, all optional fields, access validation

One-time completion (8): NextDueDate→nil, InProgress reset,
notes/cost/rating, double completion, backdated completed_at

Recurring completion (16): Daily/Weekly/BiWeekly/Monthly/Quarterly/
Yearly/Custom frequencies, late/early completion timing, multiple
sequential completions, no-original-DueDate, CompletedFromColumn capture

QuickComplete (5): one-time, recurring, widget notes, 404, 403

State transitions (10): Cancel→Complete, Archive→Complete, InProgress
cycles, recurring full lifecycle, Archive→Unarchive column restore

Kanban column priority (7): verify chain priority order for all columns

Optimistic locking (7): correct/stale version, conflict on complete/
cancel/archive/mark-in-progress, rollback verification

Deletion (5): single/multi/middle completion deletion, NextDueDate
recalculation, InProgress restore behavior documented

Edge cases (9): boundary dates, late/early recurring, nil/zero frequency
days, custom intervals, version conflicts

Handler validation (4): rating bounds, title/description length,
custom interval validation

All 679 tests pass.
2026-03-26 17:36:50 -05:00
Trey T 7f0300cc95 Add custom_interval_days to TaskResponse DTO
Field existed in Task model but was missing from API response.
Aligns Go API contract with KMM mobile model.
2026-03-26 17:06:34 -05:00
Trey T 6df27f203b Add rate limit response headers (X-RateLimit-*, Retry-After)
Custom rate limiter replacing Echo built-in, with per-IP token bucket.
Every response includes X-RateLimit-Limit, Remaining, Reset headers.
429 responses additionally include Retry-After (seconds).
CORS updated to expose rate limit headers to mobile clients.
4 unit tests for header behavior and per-IP isolation.
2026-03-26 14:36:48 -05:00
Trey T b679f28e55 Production hardening: security, resilience, observability, and compliance
Password complexity: custom validator requiring uppercase, lowercase, digit (min 8 chars)
Token expiry: 90-day token lifetime with refresh endpoint (60-90 day renewal window)
Health check: /api/health/ now pings Postgres + Redis, returns 503 on failure
Audit logging: async audit_log table for auth events (login, register, delete, etc.)
Circuit breaker: APNs/FCM push sends wrapped with 5-failure threshold, 30s recovery
FK indexes: 27 missing foreign key indexes across all tables (migration 017)
CSP header: default-src 'none'; frame-ancestors 'none'
Gzip compression: level 5 with media endpoint skipper
Prometheus metrics: /metrics endpoint using existing monitoring service
External timeouts: 15s push, 30s SMTP, context timeouts on all external calls

Migrations: 016 (token created_at), 017 (FK indexes), 018 (audit_log)
Tests: circuit breaker (15), audit service (8), token refresh (7), health (4),
       middleware expiry (5), validator (new)
2026-03-26 14:05:28 -05:00
Trey T 4abc57535e Add delete account endpoint and file encryption at rest
Delete Account (Plan #2):
- DELETE /api/auth/account/ with password or "DELETE" confirmation
- Cascade delete across 15+ tables in correct FK order
- Auth provider detection (email/apple/google) for /auth/me/
- File cleanup after account deletion
- Handler + repository tests (12 tests)

Encryption at Rest (Plan #3):
- AES-256-GCM envelope encryption (per-file DEK wrapped by KEK)
- Encrypt on upload, auto-decrypt on serve via StorageService.ReadFile()
- MediaHandler serves decrypted files via c.Blob()
- TaskService email image loading uses ReadFile()
- cmd/migrate-encrypt CLI tool with --dry-run for existing files
- Encryption service + storage service tests (18 tests)
2026-03-26 10:41:01 -05:00
Trey T 72866e935e Disable auth rate limiters in debug mode for UI test suites
Rate limiters on login/register/password-reset endpoints cause 429 errors
when running parallel UI tests that create many accounts. In debug mode,
skip rate limiters entirely so test suites can run without throttling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 15:06:18 -05:00
Trey t 42a5533a56 Fix 113 hardening issues across entire Go backend
Security:
- Replace all binding: tags with validate: + c.Validate() in admin handlers
- Add rate limiting to auth endpoints (login, register, password reset)
- Add security headers (HSTS, XSS protection, nosniff, frame options)
- Wire Google Pub/Sub token verification into webhook handler
- Replace ParseUnverified with proper OIDC/JWKS key verification
- Verify inner Apple JWS signatures in webhook handler
- Add io.LimitReader (1MB) to all webhook body reads
- Add ownership verification to file deletion
- Move hardcoded admin credentials to env vars
- Add uniqueIndex to User.Email
- Hide ConfirmationCode from JSON serialization
- Mask confirmation codes in admin responses
- Use http.DetectContentType for upload validation
- Fix path traversal in storage service
- Replace os.Getenv with Viper in stripe service
- Sanitize Redis URLs before logging
- Separate DEBUG_FIXED_CODES from DEBUG flag
- Reject weak SECRET_KEY in production
- Add host check on /_next/* proxy routes
- Use explicit localhost CORS origins in debug mode
- Replace err.Error() with generic messages in all admin error responses

Critical fixes:
- Rewrite FCM to HTTP v1 API with OAuth 2.0 service account auth
- Fix user_customuser -> auth_user table names in raw SQL
- Fix dashboard verified query to use UserProfile model
- Add escapeLikeWildcards() to prevent SQL wildcard injection

Bug fixes:
- Add bounds checks for days/expiring_soon query params (1-3650)
- Add receipt_data/transaction_id empty-check to RestoreSubscription
- Change Active bool -> *bool in device handler
- Check all unchecked GORM/FindByIDWithProfile errors
- Add validation for notification hour fields (0-23)
- Add max=10000 validation on task description updates

Transactions & data integrity:
- Wrap registration flow in transaction
- Wrap QuickComplete in transaction
- Move image creation inside completion transaction
- Wrap SetSpecialties in transaction
- Wrap GetOrCreateToken in transaction
- Wrap completion+image deletion in transaction

Performance:
- Batch completion summaries (2 queries vs 2N)
- Reuse single http.Client in IAP validation
- Cache dashboard counts (30s TTL)
- Batch COUNT queries in admin user list
- Add Limit(500) to document queries
- Add reminder_stage+due_date filters to reminder queries
- Parse AllowedTypes once at init
- In-memory user cache in auth middleware (30s TTL)
- Timezone change detection cache
- Optimize P95 with per-endpoint sorted buffers
- Replace crypto/md5 with hash/fnv for ETags

Code quality:
- Add sync.Once to all monitoring Stop()/Close() methods
- Replace 8 fmt.Printf with zerolog in auth service
- Log previously discarded errors
- Standardize delete response shapes
- Route hardcoded English through i18n
- Remove FileURL from DocumentResponse (keep MediaURL only)
- Thread user timezone through kanban board responses
- Initialize empty slices to prevent null JSON
- Extract shared field map for task Update/UpdateTx
- Delete unused SoftDeleteModel, min(), formatCron, legacy handlers

Worker & jobs:
- Wire Asynq email infrastructure into worker
- Register HandleReminderLogCleanup with daily 3AM cron
- Use per-user timezone in HandleSmartReminder
- Replace direct DB queries with repository calls
- Delete legacy reminder handlers (~200 lines)
- Delete unused task type constants

Dependencies:
- Replace archived jung-kurt/gofpdf with go-pdf/fpdf
- Replace unmaintained gomail.v2 with wneessen/go-mail
- Add TODO for Echo jwt v3 transitive dep removal

Test infrastructure:
- Fix MakeRequest/SeedLookupData error handling
- Replace os.Exit(0) with t.Skip() in scope/consistency tests
- Add 11 new FCM v1 tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 23:14:13 -05:00
Trey t 3b86d0aae1 Include completion_summary in my-residences list endpoint
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 00:14:24 -05:00
Trey t 6803f6ec18 Add honeycomb completion heatmap and data migration framework
- Add completion_summary endpoint data to residence detail response
- Track completed_from_column on task completions (overdue/due_soon/upcoming)
- Add GetCompletionSummary repo method with monthly aggregation
- Add one-time data migration framework (data_migrations table + registry)
- Add backfill migration to classify historical completions
- Add standalone backfill script for manual/dry-run usage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 00:05:10 -05:00
Trey t 739b245ee6 Fix PDF report UTF-8 encoding for residence names and task fields
Add UnicodeTranslatorFromDescriptor to convert UTF-8 strings to
Windows-1252 for gofpdf built-in fonts. Prevents garbled characters
in residence names, task titles, categories, priorities, and statuses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 11:23:44 -05:00
Trey t 7bd2cbabe9 Fix broken email icon by updating old domain references to myhoneydue.com
The email icon URL was pointing to honeyDue.treytartt.com which now returns 404.
Updated to api.myhoneydue.com along with BASE_URL, FROM_EMAIL, and CORS defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 13:38:55 -06:00
Trey t bf309f5ff9 Move admin dashboard to admin.myhoneydue.com subdomain
- Remove Next.js basePath "/admin" — admin now serves at root
- Update all internal links from /admin/xxx to /xxx
- Change Go proxy to host-based routing: admin subdomain requests
  proxy to Next.js, /admin/* redirects to main web app
- Update timeout middleware skipper for admin subdomain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 12:35:31 -06:00
Trey t 1fdc29af1c Add admin subdomain redirect for admin.myhoneydue.com
When ADMIN_HOST is set, redirects root "/" to "/admin/" so
admin.myhoneydue.com works without needing the /admin path suffix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 12:25:26 -06:00
Trey t 4976eafc6c Rebrand from Casera/MyCrib to honeyDue
Total rebrand across all Go API source files:
- Go module path: casera-api -> honeydue-api
- All imports updated (130+ files)
- Docker: containers, images, networks renamed
- Email templates: support email, noreply, icon URL
- Domains: casera.app/mycrib.treytartt.com -> honeyDue.treytartt.com
- Bundle IDs: com.tt.casera -> com.tt.honeyDue
- IAP product IDs updated
- Landing page, admin panel, config defaults
- Seeds, CI workflows, Makefile, docs
- Database table names preserved (no migration needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 06:33:38 -06:00
Trey t 793e50ce52 Add regional task templates API with climate zone lookup
Adds a new endpoint GET /api/tasks/templates/by-region/?zip= that resolves
ZIP codes to IECC climate regions and returns relevant home maintenance
task templates. Includes climate region model, region lookup service with
tests, seed data for all 8 climate zones with 50+ templates, and OpenAPI spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 15:15:30 -06:00
Trey t 72db9050f8 Add Stripe billing, free trials, and cross-platform subscription guards
- Stripe integration: add StripeService with checkout sessions, customer
  portal, and webhook handling for subscription lifecycle events.
- Free trials: auto-start configurable trial on first subscription check,
  with admin-controllable duration and enable/disable toggle.
- Cross-platform guard: prevent duplicate subscriptions across iOS, Android,
  and Stripe by checking existing platform before allowing purchase.
- Subscription model: add Stripe fields (customer_id, subscription_id,
  price_id), trial fields (trial_start, trial_end, trial_used), and
  SubscriptionSource/IsTrialActive helpers.
- API: add trial and source fields to status response, update OpenAPI spec.
- Clean up stale migration and audit docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 11:36:14 -06:00
Trey t d5bb123cd0 Redesign email templates to match web landing page Warm Sage design system
Rewrote all 11 email templates to use the Casera web brand: Outfit font via
Google Fonts, sage green (#6B8F71) brand stripe, cream (#FAFAF7) background,
pill-shaped clay (#C4856A) CTA buttons, icon-badge feature cards, numbered
tip cards, linen callout boxes, and refined light footer. Extracted reusable
helpers (emailButton, emailCodeBox, emailCalloutBox, emailAlertBox,
emailFeatureItem, emailTipCard) for consistent component composition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:02:41 -06:00
Trey t 7438dfd9b1 Fix timeout middleware panic on proxy/WebSocket routes and worker healthcheck
The TimeoutMiddleware wraps the response writer in *http.timeoutWriter which
doesn't implement http.Flusher. When the admin reverse proxy or WebSocket
upgrader tries to flush, it panics and crashes the container (502 Bad Gateway).
Skip timeout for /admin, /_next, and /ws routes.

Also fix the Dockerfile HEALTHCHECK to detect the worker process — the worker
has no HTTP server so the curl-based check always failed, marking it unhealthy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 19:56:12 -06:00