admin/honeyDueAPI

Fork 0

Files

T

Trey t 7e77e3bbab

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

docs/deployment: record security hardening pass + webapp + APNs

Mark roadmap items done (network policies, Traefik middleware, CF Full
strict, CF IP UFW restriction, webapp deploy, APNs wired up, admin
URL-baking fix, admin probe bug). Update Chapter 4 (firewall rule
inventory now shows CF-only :443, no :80), Chapter 6 (request flow
walks through TLS on :443 and middleware hops), Chapter 13 (CF SSL
mode is Full strict, not Flexible; documents the origin cert
install), Chapter 7 (adds the web service section — proxy pattern,
3 replicas, PostHog build-args), and Appendix C (web manifests, CF
origin cert paths on disk, APNs .p8 path, updated network-policies
applied status).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 15:50:59 -05:00

12 KiB

Raw Blame History

20 — Roadmap

Summary

A consolidated list of known gaps, improvements, and scaling triggers. Items are grouped by category and roughly ordered by priority. This is the "if we had more time" list referenced throughout the book.

High priority (do soon)

Uptime monitoring

Why: Right now we find out the site is down when users complain.

How: Set up Uptime Kuma (self-hosted) or Better Stack Uptime (free tier) to ping https://api.myhoneydue.com/api/health/ every minute, with Slack/email alerts on failure.

Effort: ~30 min for Uptime Kuma deploy, ~10 min for Better Stack signup.

Cloudflare origin IP restriction ✓ DONE (2026-04-24)

Both :80 and :443 Anywhere rules removed on all 3 nodes. Only CF's 15 IPv4 + 7 IPv6 ranges allowed on :443. Direct-connect attempts from non-CF IPs time out.

Still TODO: monthly automated refresh of the CF IP list. Ranges change rarely; manual re-run of scripts/ufw-cf-refresh.sh (not yet written) on cadence is acceptable for now.

Enable network policies in k3s ✓ DONE (2026-04-24)

Applied with one scaffold correction: Traefik runs as a DaemonSet with hostNetwork: true, so traffic from it arrives with the node IP as source rather than a pod IP. The original scaffold used namespaceSelector: kube-system which doesn't match hostNetwork traffic. Fixed by using an ipBlock list of the three node IPs plus the cluster pod CIDR 10.42.0.0/16.

Also added policies for web (missing from the original scaffold).

Apply Traefik security middleware ✓ DONE (2026-04-24)

security-headers + rate-limit attached to all three ingresses (api, admin, web). admin-auth is defined but not attached (needs an admin-basic-auth secret we haven't created). cloudflare-only IP allowlist exists but is redundant with the UFW-level CF restriction — keep for defense in depth if we ever expose another layer.

One scaffold correction: the Content-Security-Policy header in security-headers.customResponseHeaders was stripped. The Go API sets its own CSP in internal/router/router.go, and two CSP headers combine via intersection (most restrictive wins), which would break the Google Fonts on the marketing landing page. Next.js apps set their own via middleware.

Medium priority

Upgrade to CF Full (strict) SSL ✓ DONE (2026-04-24)

Origin CA cert (*.myhoneydue.com + myhoneydue.com, 15-year validity) stored as cloudflare-origin-cert TLS secret. All three ingresses reference it via tls: blocks. CF mode flipped from Flexible to Full (strict). Verified by:

direct-connect to origin on :443 serves the Origin cert (subject CN=CloudFlare Origin Certificate)
CF edge continues to serve its own Let's Encrypt cert to browsers
both layers now TLS-encrypted

Migration Job for schema changes

Why: Currently every api pod runs MigrateWithLock() on startup, serializing on a Postgres advisory lock. Adds 90-240s to cold startup and caused bug #13 in Chapter 19.

How: Create a Kubernetes Job resource that runs the api image with a --migrate-only flag. Job runs once per deploy, completes when schema is current. api pods get an initContainer that waits for the Job to complete.

Requires Go code change to support --migrate-only flag.

Effort: 3-4 hours (code + job manifest + testing).

Redis password

Why: Redis runs in the cluster with no auth. Any compromised pod could read cache or queue state.

How: Set REDIS_PASSWORD in honeydue-secrets, update api/worker env, update Redis command to include --requirepass. Already partially wired up in the manifests.

Effort: 20 min.

Image signing with cosign

Why: No guarantee that an image pulled from Gitea is the one we built. Gitea compromise = arbitrary code execution in cluster.

How:

Install cosign on build machine
Sign images as part of deploy: cosign sign gitea.treytartt.com/admin/honeydue-api:<sha>
Deploy Kyverno (or Connaisseur) to cluster
Apply cluster policy requiring all images have valid cosign signatures

Effort: 4-6 hours.

etcd encryption at rest

Why: Kubernetes Secrets are stored in etcd unencrypted by default. Node disk compromise = plaintext secrets.

How: K3s supports --secrets-encryption flag at server install. Need to recreate cluster or re-install k3s server on each node.

Effort: 1 hour.

Automated unattended-upgrades

Why: Currently OS patches require manual apt upgrade. Security patches can be delayed.

How:

sudo apt install unattended-upgrades
# Configure /etc/apt/apt.conf.d/50unattended-upgrades for security-only
sudo dpkg-reconfigure -plow unattended-upgrades

Effort: 30 min per node.

fail2ban

Why: SSH is open to the world. No rate limiting on failed attempts. Bot noise is constant.

How: sudo apt install fail2ban; sudo systemctl enable --now fail2ban. Default config bans IPs after 5 failed attempts for 10 min.

Effort: 15 min per node.

Move SSH off port 22

Why: Port 22 attracts constant scanner noise. Moving to a non-default port cuts >90% of attempts.

How:

Edit /etc/ssh/sshd_config on each node: Port 2222
UFW rule: sudo ufw allow 2222/tcp
Update ~/.ssh/config on operator: Port 2222
Restart sshd: sudo systemctl restart ssh
Remove UFW rule for port 22 after verifying

Effort: 30 min (and pray).

Lower priority

Prometheus + Grafana

Why: Historical metrics, dashboards, alerting.

How: kube-prometheus-stack Helm chart. Adds ~500 MB RAM across cluster.

Effort: 4-6 hours including dashboard setup.

Loki log aggregation

Why: Cross-pod log queries, longer retention.

How: grafana/loki + promtail DaemonSet. Integrates with existing Grafana.

Effort: 2-3 hours.

OpenTelemetry tracing

Why: Request-level profiling. Show which hop dominates p99 latency.

How: Add OpenTelemetry SDK to Go app; export to Jaeger/Tempo.

Effort: 8-12 hours including tuning.

Hetzner private network

Why: Currently all inter-node traffic (including Flannel overlay) goes over public network. Private network = less attack surface, no bandwidth costs (if metered in future).

How: Attach Hetzner vswitch to the 3 nodes, reconfigure Flannel to advertise private IPs, update UFW rules to allow from private IP range instead of specific public IPs.

Effort: 2-3 hours including testing Flannel reconfig.

Move secrets to Vault

Why: Kubernetes Secrets are base64-encoded etcd values. Vault is purpose-built for secret management with audit logs, dynamic secrets, rotation policies.

How: Deploy Vault in the cluster (or external), migrate secret values, use Vault Agent Injector or External Secrets Operator.

Effort: 6-8 hours.

Not high priority until we have multiple engineers who shouldn't see every secret, or compliance requirements.

Automated backups to B2

Why: Neon's backup is Neon's problem. If Neon-as-a-company disappeared, we'd lose everything.

How: Nightly pg_dump | gzip | aws s3 cp (via s3cmd for B2) as a CronJob in the cluster.

Effort: 2 hours.

Multi-region

Why: ~100 ms CF→origin hop could be reduced by having origins in multiple regions. Not needed at current scale.

How: Add 2 more Hetzner nodes in ash (Ashburn, US). Separate k3s cluster (or one stretched cluster — painful). Cloudflare Load Balancing for geo-based routing.

Effort: Days of work, doubling cost. Don't until traffic justifies.

CF Workers for static + caching

Why: Certain endpoints (the marketing landing page, public API lookups) could serve from CF Workers with near-zero origin load.

How: Move static pages to Cloudflare Pages; cache API responses with Cache-Control: public, max-age=300.

Effort: 4-6 hours.

WireGuard-encrypted overlay

Why: Current Flannel VXLAN is plaintext between nodes. An attacker with Hetzner-internal network access could read pod-to-pod traffic.

How: K3s supports --flannel-backend=wireguard-native. Reinstall k3s server on each node with the new backend.

Effort: 2-3 hours (requires brief downtime).

Scaling triggers

Trigger	Action
p99 latency > 500ms sustained	Investigate with tracing; consider CF Workers for cached paths
API CPU > 70% sustained	HPA already configured; may need more nodes
DB connections at Neon limit	Upgrade Neon Scale or reduce `DB_MAX_OPEN_CONNS`
Redis memory > 80%	Scale Redis memory; consider cache sharding
B2 storage > 500 GB	Evaluate if R2 (free egress) is cheaper overall
Active users > 100k	Evaluate multi-region, CF Pro, paid monitoring
Revenue > $5k/mo	Hire ops help; this document assumes solo operator

Known gaps we accept

No canary deploys: all-or-nothing rollouts via kubectl set image
No feature flags (app-level): code is deployed as-is. Can't toggle features without re-deploying
No A/B testing infra: out of scope for current product stage
No Windows/tablet-specific CDN rules: CF serves everyone the same responses
No explicit blue-green: rolling updates only

Stuff to delete when brave

deploy/ (the Swarm era) — once we've been on k3s 30 days
Legacy UFW rules from the Swarm era (2377, 7946, 4789, ESP, 500, 3000) — they don't hurt but they're confusing
deploy-k3s/manifests/secrets.yaml.example — we don't use this pattern, we create secrets imperatively

Stuff that could go wrong and we should plan for

Hetzner price hike: 2026-04-01 already happened. If another one comes, we could migrate to Netcup or OVH for savings.
Neon EOL free tier: Neon could change pricing policy. Fallback: self-host Postgres on a Hetzner box or migrate to Supabase.
Cloudflare Free plan changes: CF could restrict Free features. Fallback: BunnyCDN, or raw nodes without CDN.
Gitea host outage: If Gitea is down, deploys can't pull new images. Existing pods continue. For long outages, we'd cache images locally or temporarily push to Docker Hub.

Progress tracker

As items are done, mark them here. Think of this as a running changelog.

k3s migration from Swarm (2026-04-24)
Traefik DaemonSet + hostNetwork (2026-04-24)
Admin seed via ADMIN_EMAIL + ADMIN_PASSWORD (2026-04-24)
Documentation book (this doc set) (2026-04-24)
Web client deployed at app.myhoneydue.com (2026-04-24) — Next.js 16 standalone, 3 replicas with PDB, proxy pattern to api, see Chapter 7.
Admin URL-baking fix (2026-04-24) — Dockerfile ARG NEXT_PUBLIC_API_URL, .dockerignore hardening for admin/.env.*.
Auto-seed initial data on first API boot (2026-04-24) — 20260414_seed_initial_data migration populates lookups, admin user, task templates. See commit 4ec4bbb.
APNs wired up (2026-04-24) — Key ID 5L5BVF5G48, Team ID X86BR9WTLD, sandbox mode. Secret honeydue-apns-key, FEATURE_PUSH_ENABLED=true.
Traefik middleware: security-headers + rate-limit attached to all three ingresses (2026-04-24). CSP is stripped from the middleware because the Go API sets its own.
Admin liveness probe path fix (2026-04-24) — was hitting /admin/ (404) and crashlooping every ~90s for 6 hours before the bug was caught. Fixed to /.
Network policies applied (2026-04-24) — default-deny + explicit allows. Traefik hostNetwork is matched via node IP ipBlocks, not namespaceSelector. See Chapter 5.
Cloudflare Full (strict) SSL (2026-04-24) — Origin CA cert installed as cloudflare-origin-cert secret, ingresses have tls: blocks, CF mode flipped from Flexible. Both user↔CF and CF↔origin now TLS.
UFW CF-IP allowlist on all 3 nodes (2026-04-24) — 15 IPv4 + 7 IPv6 CF ranges allow :443; Anywhere rules for :80 and :443 deleted. Direct-connect from non-CF IPs times out.
All other items above

12 KiB Raw Blame History

20 — Roadmap

Summary

High priority (do soon)

Uptime monitoring

Cloudflare origin IP restriction ✓ DONE (2026-04-24)

Enable network policies in k3s ✓ DONE (2026-04-24)

Apply Traefik security middleware ✓ DONE (2026-04-24)

Medium priority

Upgrade to CF Full (strict) SSL ✓ DONE (2026-04-24)

Migration Job for schema changes

Redis password

Image signing with cosign

etcd encryption at rest

Automated unattended-upgrades

fail2ban

Move SSH off port 22

Lower priority

Prometheus + Grafana

Loki log aggregation

OpenTelemetry tracing

Hetzner private network

Move secrets to Vault

Automated backups to B2

Multi-region

CF Workers for static + caching

WireGuard-encrypted overlay

Scaling triggers

Known gaps we accept

Stuff to delete when brave

Stuff that could go wrong and we should plan for

Progress tracker

12 KiB

Raw Blame History