Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider
that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling
is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod,
overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api
and honeydue-worker. otelecho.Middleware opens a span per HTTP request.
Step 2 — Manual spans: storage_service.Upload now takes ctx and emits
storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket,
result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token
spans with topic, status_code, reason. Asynq middleware emits
asynq.handle:<task_type> per job with retry/payload attrs and records
asynq_job_duration_seconds.
Step 3 — Database: otelgorm plugin registered in database.Connect, so
any SQL emitted via db.WithContext(ctx) attaches to the request span.
Every repository now exposes WithContext(ctx) *XRepository as the
migration helper. TaskService.ListTasks and GetTasksByResidence are
migrated end-to-end (ctx threaded through handler → service → repo);
remaining services adopt the same pattern incrementally — pre-migration
methods still emit untraced SQL via the unchanged db field.
OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env →
honeydue-secrets → api+worker Deployments via secretKeyRef (optional).
02-setup-secrets.sh sources them from prod.env on next run; manifests
mark both env vars optional so the deployment rolls without traces if
the secret is absent.
ch15 observability doc now lists what produces spans today vs the
remaining migration work, with the explicit per-method pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.
Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
(full build + push + apply + obs vmagent), with the manual
kubectl-set-image flow kept as the single-service path. Notes the
IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
sheet block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so
vmagent received double-compressed bytes that failed the Prometheus
parser with binary garbage. Skipping /metrics in the gzip Skipper.
Three deploy-script fixes uncovered while shipping this:
- _config.sh had backticks around \"kubectl get cm\" inside the python
heredoc, which bash treated as command substitution when KUBECONFIG
was set. Quoted the literal instead.
- 03-deploy.sh now passes --platform linux/amd64 to all docker builds
so arm64 Macs don't push images that fail with \"exec format error\"
on the Hetzner CX nodes.
- OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of
the actual deploy/prod.env at the repo root.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vmagent.yaml lives under manifests/observability/; the deploy script now
substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest
before apply, and waits on the vmagent rollout. Manual kubectl apply is
no longer needed after the next deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.
Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.
deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.
docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The B2 credentials existed in honeydue-secrets (created by
02-setup-secrets.sh) but were never referenced from the api
Deployment, so StorageConfig.IsS3() returned false at runtime →
StorageService fell back to local filesystem. With
readOnlyRootFilesystem=true on the api container, that local
fallback would silently fail on every upload — meaning every
photo, document, and task-completion upload was broken in prod
since the k3s migration on 2026-04-24.
Adding both as secretKeyRef on the api container only (the worker
doesn't perform uploads). Verified end-to-end with a registered
test user: source PDF (sha256=3af3a645...) → POST /api/uploads/document/
→ POST /api/documents/ → GET /api/media/document/:id → byte-identical
download. Storage init log now reports "Storage service initialized (S3)".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_env was missing 5 keys that exist in the live honeydue-config
ConfigMap (drift introduced over time by manual kubectl patches):
STATIC_DIR, STORAGE_UPLOAD_DIR, STORAGE_BASE_URL, B2_REGION, B2_USE_SSL.
Without these, running 03-deploy.sh would silently drop them and
break static asset serving + B2 region/TLS.
Also:
- Move B2_KEY_ID/B2_APP_KEY out of generate_env: they're credentials
and belong in honeydue-secrets, not cleartext in the ConfigMap. The
api/worker deployments still need to be wired to read them via
envFrom: secretRef before B2 uploads will work — pre-existing gap,
not caused by this commit.
- Use the in-namespace short DNS form for REDIS_URL ('redis:6379') to
match what the live cluster has — pods' resolv.conf search path
already covers honeydue.svc.cluster.local.
- config.yaml.example: add b2_region, b2_use_ssl, upload_dir, base_url,
static_dir under storage so a fresh bootstrap sets them correctly.
Verified by sourcing _config.sh and diffing generate_env output against
`kubectl get cm honeydue-config -o jsonpath='{.data}'`: clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The iOS app was renamed (MyCrib → Casera → honeyDue) and the bundle ID
was updated to com.myhoneydue.honeyDue (release) / .dev (debug), but
APPLE_CLIENT_ID and APNS_TOPIC across env templates and k3s configs
still pointed at the old com.tt.honeyDue.honeyDueDev value. This made
verifyAudience reject every Apple identity token (aud claim mismatch).
Updated:
- deploy/prod.env.example: bundle ID + comment that empty client_id
rejects all tokens with DEBUG=false
- .env.example: add Sign in with Apple block (was missing entirely)
- deploy-k3s{,-dev}/config.yaml.example: apple_auth.client_id default
- deploy-k3s-dev/scripts/00-init.sh: same
- docker-compose.dev.yml: APNS_TOPIC fallback
- docs/deployment/10-secrets-config.md: doc reference
The live deploy/prod.env and local .env are .gitignored — they were
edited in place and need to ship via deploy_prod.sh to take effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four related hardening changes made on the live cluster during this
session. Each manifest captures the final working state so a fresh
`kubectl apply` of the repo reproduces it.
1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks
pointing at `cloudflare-origin-cert` secret (installed imperatively
from the CF Origin CA PEM). CF SSL mode flipped from Flexible to
Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued
cert that only CF can validate.
2. Traefik middleware attached to all three ingresses — `rate-limit`
(100/min avg, 200 burst) and `security-headers` (frame-deny,
nosniff, HSTS, referrer policy, permissions policy). `admin-auth`
middleware was also defined in middleware.yaml but is not attached
(needs an unset basic-auth secret) and was deleted at runtime.
3. `security-headers` middleware: stripped the
Content-Security-Policy entry. The Go API sets its own CSP in
internal/router/router.go that permits Google Fonts for the
landing page. Two CSP headers combine via intersection (most
restrictive wins), which would break the landing page. Next.js
apps set their own CSP via middleware. Header kept documentation
comments explain this.
4. NetworkPolicies — default-deny + explicit allows, applied. Added
missing policies for `web`. Corrected the Traefik ingress rule: the
scaffold used `namespaceSelector: kube-system`, but our Traefik
runs as a DaemonSet with `hostNetwork: true`, so traffic arrives
with the NODE IP as source. Fixed to an `ipBlock` list of the
three node IPs plus the cluster pod CIDR (10.42.0.0/16).
5. admin livenessProbe path fix: was hitting /admin/ (404) which
caused a 6-hour crashloop cycle (87 restarts) before the bug was
caught. Fixed to / — matches the startupProbe and readinessProbe
paths that were corrected earlier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs
alongside api/worker/admin on the cluster. Uses a server-side proxy
pattern: browser hits app.myhoneydue.com, Next.js route handlers
forward to the Go API with an httpOnly cookie, so no CORS entry or
Allowed-Hosts change is needed on the API side.
Availability mirrors api (3 replicas, PDB minAvailable:2,
topologySpreadConstraints across nodes).
Changes:
- deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root
FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp,
reads API_URL from honeydue-config.
- deploy-k3s/manifests/web/service.yaml: ClusterIP :3000.
- deploy-k3s/manifests/rbac.yaml: ServiceAccount web with
automountServiceAccountToken: false.
- deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb
minAvailable: 2.
- deploy-k3s/manifests/ingress/ingress-simple.yaml: route
app.myhoneydue.com → web:3000.
- deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap.
- deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image
alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and
NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed).
Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin
image that was previously only done manually.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Next.js bakes NEXT_PUBLIC_* vars into the client JS bundle at build
time, not runtime. The admin image was being built with
admin/.env.local containing NEXT_PUBLIC_API_URL=http://localhost:8000,
hardcoding localhost into the browser bundle. The runtime configMap
value had no effect on the already-compiled JS, causing prod admin
login to throw CORS errors hitting localhost.
Fix:
- Dockerfile: admin-builder stage accepts ARG NEXT_PUBLIC_API_URL and
strips any committed .env.local/.env.development.local before
npm run build.
- .dockerignore: explicitly exclude admin/.env.* (root-level .env.*
pattern doesn't match nested paths), so a local dev .env.local can
never sneak into the build context again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Getting Started (all 3 install paths): simplify to one "start the app,
first-boot auto-seeds lookups + admin + templates" step + optional
dev test-data seed
- Seed Data table: mark 001_lookups / 003_admin_user / 003_task_templates
as auto-seeded via internal/database/migration_seed_initial_data.go;
only 002_test_data.sql is manual now
- Environment Variables: split into logical groups (core, server, admin
seed, email, push, B2, worker schedules, feature flags, Apple/Google),
added ~45 vars that weren't documented. Defer full reference to
docs/deployment/10-secrets-config.md
- Add ADMIN_EMAIL / ADMIN_PASSWORD to the admin-seed group
- Tech Stack: add Backblaze B2 (minio-go), Fastmail/go-mail, Cloudflare,
K3s (production orchestrator)
- Project Structure: add deploy-k3s/ and docs/deployment/; mark
docker-compose.yml as Swarm-era legacy
- Docker subsection: clarify compose files are local dev; point at
deployment book for prod workflow
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a data_migration that runs seeds/001_lookups.sql,
seeds/003_admin_user.sql, and seeds/003_task_templates.sql exactly
once on startup and invalidates the Redis seeded_data cache afterwards
so /api/static_data/ returns fresh results. Removes the need to
remember `./dev.sh seed-all`; the data_migrations tracking row prevents
re-runs, and each INSERT uses ON CONFLICT DO UPDATE so re-execution is
safe.
The `000016_task_template_id` and `000017_drop_task_template_regions_join`
migrations introduced on gitea collided with the existing unpadded 016/017
migrations (authtoken_created_at, fk_indexes). Renamed them to 021/022 so
they extend the shipped sequence instead of replacing real migrations.
Also removed the padded 000012-000015 files which were duplicate content
of the shipped 012-015 unpadded migrations.
Dockerfile builder image bumped from golang:1.24-alpine to 1.25-alpine to
match go.mod's `go 1.25` directive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clients that send users through a multi-task onboarding step no longer
loop N POST /api/tasks/ calls and no longer create "orphan" tasks with
no reference to the TaskTemplate they came from.
Task model
- New task_template_id column + GORM FK (migration 000016)
- CreateTaskRequest.template_id, TaskResponse.template_id
- task_service.CreateTask persists the backlink
Bulk endpoint
- POST /api/tasks/bulk/ — 1-50 tasks in a single transaction,
returns every created row + TotalSummary. Single residence access
check, per-entry residence_id is overridden with batch value
- task_handler.BulkCreateTasks + task_service.BulkCreateTasks using
db.Transaction; task_repo.CreateTx + FindByIDTx helpers
Climate-region scoring
- templateConditions gains ClimateRegionID; suggestion_service scores
residence.PostalCode -> ZipToState -> GetClimateRegionIDByState against
the template's conditions JSON (no penalty on mismatch / unknown ZIP)
- regionMatchBonus 0.35, totalProfileFields 14 -> 15
- Standalone GET /api/tasks/templates/by-region/ removed; legacy
task_tasktemplate_regions many-to-many dropped (migration 000017).
Region affinity now lives entirely in the template's conditions JSON
Tests
- +11 cases across task_service_test, task_handler_test, suggestion_
service_test: template_id persistence, bulk rollback + cap + auth,
region match / mismatch / no-ZIP / unknown-ZIP / stacks-with-others
Docs
- docs/openapi.yaml: /tasks/bulk/ + BulkCreateTasks schemas, template_id
on TaskResponse + CreateTaskRequest, /templates/by-region/ removed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service
Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy
Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown
Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
multi-arch recipe, connection-pool tuning, renumbered sections
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduces a StorageBackend interface with local filesystem and S3
implementations. The StorageService delegates raw I/O to the backend
while keeping validation, encryption, and URL generation unchanged.
Backend selection is config-driven: set B2_ENDPOINT + B2_KEY_ID +
B2_APP_KEY + B2_BUCKET_NAME for S3 mode, or STORAGE_UPLOAD_DIR for
local mode. STORAGE_USE_SSL=false for in-cluster MinIO (HTTP).
All existing tests pass unchanged — the local backend preserves
identical behavior to the previous direct-filesystem implementation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.
Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Suggestion engine now purely uses home profile features (heating,
cooling, pool, etc.) for template matching. Climate region field
and matching block removed — ZIP code is no longer collected.
14 new optional residence fields (heating, cooling, water heater, roof,
pool, sprinkler, septic, fireplace, garage, basement, attic, exterior,
flooring, landscaping) with JSONB conditions on templates.
Suggestion engine scores templates against home profile: string match
+0.25, bool +0.3, property type +0.15, universal base 0.3. Graceful
degradation from minimal to full profile info.
GET /api/tasks/suggestions/?residence_id=X returns ranked templates.
54 template conditions across 44 templates in seed data.
8 suggestion service tests.
Custom rate limiter replacing Echo built-in, with per-IP token bucket.
Every response includes X-RateLimit-Limit, Remaining, Reset headers.
429 responses additionally include Retry-After (seconds).
CORS updated to expose rate limit headers to mobile clients.
4 unit tests for header behavior and per-IP isolation.
Rate limiters on login/register/password-reset endpoints cause 429 errors
when running parallel UI tests that create many accounts. In debug mode,
skip rate limiters entirely so test suites can run without throttling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add UnicodeTranslatorFromDescriptor to convert UTF-8 strings to
Windows-1252 for gofpdf built-in fonts. Prevents garbled characters
in residence names, task titles, categories, priorities, and statuses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The email icon URL was pointing to honeyDue.treytartt.com which now returns 404.
Updated to api.myhoneydue.com along with BASE_URL, FROM_EMAIL, and CORS defaults.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove Next.js basePath "/admin" — admin now serves at root
- Update all internal links from /admin/xxx to /xxx
- Change Go proxy to host-based routing: admin subdomain requests
proxy to Next.js, /admin/* redirects to main web app
- Update timeout middleware skipper for admin subdomain
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ADMIN_HOST is set, redirects root "/" to "/admin/" so
admin.myhoneydue.com works without needing the /admin path suffix.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a new endpoint GET /api/tasks/templates/by-region/?zip= that resolves
ZIP codes to IECC climate regions and returns relevant home maintenance
task templates. Includes climate region model, region lookup service with
tests, seed data for all 8 climate zones with 50+ templates, and OpenAPI spec.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Stripe integration: add StripeService with checkout sessions, customer
portal, and webhook handling for subscription lifecycle events.
- Free trials: auto-start configurable trial on first subscription check,
with admin-controllable duration and enable/disable toggle.
- Cross-platform guard: prevent duplicate subscriptions across iOS, Android,
and Stripe by checking existing platform before allowing purchase.
- Subscription model: add Stripe fields (customer_id, subscription_id,
price_id), trial fields (trial_start, trial_end, trial_used), and
SubscriptionSource/IsTrialActive helpers.
- API: add trial and source fields to status response, update OpenAPI spec.
- Clean up stale migration and audit docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrote all 11 email templates to use the Casera web brand: Outfit font via
Google Fonts, sage green (#6B8F71) brand stripe, cream (#FAFAF7) background,
pill-shaped clay (#C4856A) CTA buttons, icon-badge feature cards, numbered
tip cards, linen callout boxes, and refined light footer. Extracted reusable
helpers (emailButton, emailCodeBox, emailCalloutBox, emailAlertBox,
emailFeatureItem, emailTipCard) for consistent component composition.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alpine's busybox pgrep -x doesn't match process names correctly.
Use pgrep -f /app/worker to match the full command path instead.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TimeoutMiddleware wraps the response writer in *http.timeoutWriter which
doesn't implement http.Flusher. When the admin reverse proxy or WebSocket
upgrader tries to flush, it panics and crashes the container (502 Bad Gateway).
Skip timeout for /admin, /_next, and /ws routes.
Also fix the Dockerfile HEALTHCHECK to detect the worker process — the worker
has no HTTP server so the curl-based check always failed, marking it unhealthy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Dockerfile: use --platform=$BUILDPLATFORM + ARG TARGETARCH instead of
hardcoded GOARCH=arm64, enabling cross-compilation and native builds
on both arm64 (M1) and amd64 (prod server)
- docker-compose.yml: rewrite for Docker Swarm — image refs, deploy
sections, overlay network, no container_name/depends_on conditions,
DB/Redis ports not exposed externally
- docker-compose.dev.yml: rewrite as self-contained dev compose with
build targets, container_name, depends_on, dev-safe defaults
- Makefile: switch to docker compose v2, point dev targets at
docker-compose.dev.yml, add docker-build-prod target
- Delete stale docker/Dockerfile (Go 1.21) and docker/docker-compose.yml
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alpine Linux resolves localhost to IPv6 ::1, but Next.js binds to
IPv4 0.0.0.0 — causing the healthcheck to fail with connection refused.
Also update worker env vars from legacy Celery names to current ones.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>