Files
Trey t 1347ffadf5
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
docs: presigned-URL upload flow + B2 lifecycle setup
09-storage.md:
  - Replaced the "Upload flow" section. The previous text described the
    multipart-via-API path that was removed in b7f8329. Now documents
    the three-step direct-to-B2 flow (presign → POST to B2 → attach
    via upload_ids[]) with an ASCII diagram and a server-side
    enforcement-points table.
  - Replaced the "Future: signed URLs" placeholder (since presigned
    URLs are now the present, not the future).
  - Added "Lifecycle and retention" subsections covering the
    pending_uploads cleanup cron (worker, 30 * * * *), the B2 bucket
    lifecycle as backstop (uploads/ prefix, 7-day hide + 1-day delete),
    and the still-open user-deletion cascade gap.

14-deployment-process.md:
  - Added a "One-time B2 bucket lifecycle (manual)" section explaining
    why the rule can't live in the deploy script (B2's S3 lifecycle
    API is partial), the exact rule to apply via the Backblaze
    console, and a verification command.

docs/deployment/README.md:
  - Updated the chapter 9 description to mention presigned-URL uploads.

README.md (root):
  - Added a paragraph under "Object storage" pointing to the new
    upload architecture and the relevant deployment-book chapters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:44:08 -07:00

12 KiB
Raw Permalink Blame History

09 — Object Storage (Backblaze B2)

Summary

User-uploaded files (photos, documents, task completion attachments) go to Backblaze B2 via its S3-compatible API. The Go API uses minio-go/v7 as the client. This works around a Swarm-era problem where named volumes are per-node — uploads on node A were invisible to replicas on B and C. With k3s we could use a shared PVC instead, but B2 is cheaper, offsite, and already set up.

Why Backblaze B2

Decision matrix

Option Price per TB stored Egress Pros Cons
Backblaze B2 $6/mo $0.01/GB, free via CF Cheap, hard spending caps, S3-compatible US-West/East regions only (not EU)
AWS S3 Standard $23/mo $0.09/GB Most ubiquitous Expensive
Cloudflare R2 $15/mo Free (!) Zero egress, CF-native Newer, fewer features
DigitalOcean Spaces $5/mo for 250GB + $0.01/GB Free 1TB, $0.01/GB after Simple Less reliable than AWS
Local PVC on k3s $0 $0 Already in cluster Per-node, no HA, no offsite

B2 won because:

  1. Hard spending cap — unique in the industry. No surprise AWS bill.
  2. Cheapest at rest — 34× cheaper than S3.
  3. Free egress through Cloudflare — we already use CF; when we eventually serve upload URLs through CF, egress is free.
  4. Mature S3-compatible API — minio-go talks to it natively.

Rejected:

  • R2 was the close second. Zero egress is amazing. Rejected primarily for inertia (B2 already set up in the MyCrib era). A future migration to R2 would be reasonable.
  • Local PVC doesn't work for our setup because we want uploads durable and accessible from any node/replica.

Configuration

Bucket: honeyDueProd (mixed case; B2 allows this, minio-go handles it via path-style addressing — see §path-style below).

Region: us-east-005 (B2's South Carolina region — closer to our Neon DB in AWS us-east-1 than the West Coast options).

Endpoint: s3.us-east-005.backblazeb2.com

Environment variables

From ConfigMap:

Var Value
B2_ENDPOINT s3.us-east-005.backblazeb2.com
B2_BUCKET_NAME honeyDueProd
B2_REGION us-east-005
B2_USE_SSL true (but see §vestigial var below)

From Secret:

Var Value
B2_KEY_ID App key ID (B2-specific identifier)
B2_APP_KEY App key secret

App key scope

The B2 app key is bucket-scoped, not account-scoped. Can only read/write the honeyDueProd bucket. Cannot:

  • List other buckets
  • Delete the bucket
  • Create new buckets
  • Touch account settings

This is the B2 equivalent of an IAM role with least privilege. If the key leaks, the damage is limited to the honeyDueProd bucket.

The minio-go client

The Go app uses github.com/minio/minio-go/v7 — a Go SDK compatible with any S3-flavored API. Relevant code at internal/services/storage_backend_s3.go:

client, err := minio.New(endpoint, &minio.Options{
    Creds:  credentials.NewStaticV4(keyID, appKey, ""),
    Secure: useSSL,
    Region: region,
})

Path-style vs virtual-hosted addressing

S3's URL scheme has two flavors:

  • Virtual-hosted: https://mybucket.s3.amazonaws.com/mykey
  • Path-style: https://s3.amazonaws.com/mybucket/mykey

With virtual-hosted style, the bucket name must be DNS-compatible — lowercase, no uppercase letters. honeyDueProd fails this.

With path-style, the bucket name is just a URL path segment — any valid string works.

minio-go auto-detects: for AWS S3 it prefers virtual-hosted; for non-AWS endpoints (like B2) it defaults to path-style. So honeyDueProd with capital letters works transparently.

The B2_USE_SSL vestigial variable

prod.env has B2_USE_SSL=true. But the Go app's internal/config/config.go:295 reads the env var STORAGE_USE_SSL, not B2_USE_SSL:

S3UseSSL: viper.GetString("STORAGE_USE_SSL") == "" || viper.GetBool("STORAGE_USE_SSL"),

Whoever wrote the original config used B2_USE_SSL in prod.env and STORAGE_USE_SSL in the code. They don't match.

Net effect: The app reads STORAGE_USE_SSL, which is unset, and the default (empty) || true evaluates to true. So SSL is always on, despite B2_USE_SSL=false or true or anything else.

This is a dormant bug. Anyone setting B2_USE_SSL=false expecting to disable TLS would be surprised it stays on. Fortunately that's the right default for production B2 (which only accepts HTTPS anyway).

TODO: Rename STORAGE_USE_SSLB2_USE_SSL in the Go code to match the config. Documented in Chapter 19 §Vestigial config.

What we store there

Today (limited rollout):

  • User profile photos
  • Task completion photos
  • Document uploads (PDFs, images attached to records)

File keys follow a hierarchy like:

users/<user_id>/profile/<uuid>.jpg
residences/<residence_id>/documents/<uuid>.pdf
tasks/<task_id>/completions/<uuid>.jpg

Max file size is 10 MB per upload (STORAGE_MAX_FILE_SIZE=10485760). Allowed MIME types: image/jpeg, image/png, image/gif, image/webp, application/pdf (STORAGE_ALLOWED_TYPES).

Access control

Upload flow (current — direct-to-B2 with presigned POST)

Image and document uploads go directly from the client to B2. The api server only signs a short-lived POST policy; the bytes never traverse our cluster. This is the WhatsApp / Slack architecture and sidesteps the api as a proxy bottleneck.

  1. Client POST /api/uploads/presign with {category, content_type, content_length}.
  2. api validates auth, per-user quota (10 concurrent in-flight, 50/hour rate limit), allowed mime, and the 10 MB cap. On success it creates a pending_uploads row, signs a B2 POST policy with a content-length-range condition bound to the claimed length ±256 bytes, and returns {id, upload_url, fields, key, expires_at}.
  3. Client multipart-POSTs the bytes directly to B2 using the returned fields. B2 enforces the size cap at the protocol level — clients can't bypass it by lying about Content-Length.
  4. Client POSTs to the entity-creation endpoint (/api/task-completions/, /api/documents/) with upload_ids: [id]. The service HEADs each B2 object, verifies size matches expected_bytes, marks the pending_uploads.claimed_at, and writes the task_completion_image / document_image row referencing the upload.

The signed URL is valid for 15 minutes; presigns are not reusable.

The B2 bucket stays private — only the api ever holds the key material. Clients can't list or GET directly without a presign.

┌──────────┐   1) presign        ┌────────┐
│  client  │ ──────────────────► │  api   │
│          │ ◄────────────────── │        │  POST policy + key
│          │                     └────────┘
│          │                                   row in
│          │                          pending_uploads
│          │                          (claimed_at NULL)
│          │   2) POST bytes      ┌────────┐
│          │ ──────────────────►  │   B2   │  enforces policy
│          │ ◄────────────────── │        │
│          │                     └────────┘
│          │   3) attach          ┌────────┐
│          │ ──────────────────►  │  api   │  HEAD B2 object,
│          │  upload_ids: [id]    │        │  mark claimed_at,
│          │                     └────────┘  insert image row
└──────────┘

Server-side enforcement summary:

Check Where Reject if
Auth api middleware unauthenticated
Mime allowlist upload_service.go:allowedContentTypes not in list for category
Size cap (10 MB) api before signing + B2 policy content_length > 10 MiB
Concurrency cap (10) CountUnclaimedActiveForUser already 10 unclaimed in-flight
Rate limit (50/hr) Redis sliding window upload:presign:<uid>:<bucket> 51st presign in the same hour
Size at upload time B2 (signed policy) bytes outside content-length-range
Ownership at attach FindUnclaimedForUser upload_id belongs to a different user
Bytes match claim s3.Stat() + bytes comparison actual size differs from expected ±256

Download flow (current)

  1. Client requests /api/media/<key>
  2. Go API checks the user can access this key
  3. Go API fetches from B2 and streams back to the client

This proxies every download through the api. For high-traffic media that's inefficient (api becomes an egress bottleneck) — could be replaced with presigned GET URLs on the same bucket. Not yet shipped; download volume is low enough that the proxy is fine for now.

Lifecycle and retention

Orphan cleanup (pending_uploads)

Every presign creates a row in pending_uploads with expires_at = now + 15 min. If the client never finishes the upload, or finishes but never calls the attach endpoint, the row stays unclaimed. An hourly cron in the worker reaps them:

  • maintenance:upload_cleanup — cron 30 * * * *. Selects unclaimed rows past expires_at, deletes the corresponding B2 object, deletes the row. Up to 500 per tick; the next tick picks up any overflow. Worker logs include reaped count.

The worker constructs a StorageService at startup; if storage init fails (e.g. B2_KEY_ID / B2_APP_KEY not wired into the worker deployment), the cleanup handler logs a warning and no-ops. See deploy-k3s/manifests/worker/deployment.yaml — both B2 secrets are required envs on this pod.

Bucket lifecycle (backstop)

A B2 lifecycle rule on the uploads/ prefix is the safety net if the worker is offline for an extended period:

  • Hide objects 7 days after upload.
  • Delete 1 day after hidden.

This is configured manually via the Backblaze console (B2's S3 lifecycle API isn't fully implemented). See deploy-k3s/manifests/b2-lifecycle.md for the exact rule and b2 bucket get-info verification command.

User-deletion cascade

When a user deletes their account, the app deletes their task_* / document rows. The associated B2 objects survive — same compliance gap as before, not yet automated. Two approaches:

  • Walk the image rows on user delete and RemoveObject each (simple, synchronous, slow for users with many uploads).
  • Tag objects with a user_id metadata header at upload time, then use a B2 lifecycle rule scoped to a deleted-users prefix.

Option 1 is the next item in the upload roadmap.

Backup of B2

We have no backup of B2 objects. B2 itself replicates within the region, but:

  • Accidental deletion via our app = data gone
  • B2 itself being compromised = data gone

B2 offers Object Lock (WORM — write once read many) which prevents deletion for a retention period. Not enabled; revisit if/when user data sensitivity justifies it.

Cost projection

Current usage is small — estimated <50 GB stored.

50 GB × $0.006/GB = $0.30/mo storage
1 GB/mo egress (mostly uncached media served via api) → $0.01 (first
  3× of stored amount is free anyway, so effectively $0)

Total B2 cost: < $1/mo. Hard spending cap set to $20/mo in B2 console — if we ever breach that, something's wrong and we want to know immediately.

At 100k users each uploading ~10 MB average:

  • 1 TB stored = $6/mo
  • Egress depends on access patterns; with signed URLs served through CF the egress could still be ~free

Operator cheat sheet

# List bucket contents (requires mc or aws CLI configured with B2 creds)
mc alias set b2 https://s3.us-east-005.backblazeb2.com <KEY_ID> <APP_KEY>
mc ls b2/honeyDueProd/

# Count objects
mc find b2/honeyDueProd/ --type f | wc -l

# Download an object
mc cp b2/honeyDueProd/<key> ./

# Check B2 console for usage graphs:
#   https://secure.backblaze.com/b2_buckets.htm

From inside a Go api pod:

# Check the in-cluster client config
kubectl exec -n honeydue deploy/api -- env | grep B2_

References