Operations
No DB, no migrations, no workers. Operations = deploying the proxy + portal, watching Traefik metrics, and tuning resilience.
1. Deployment
| Artifact | Image | Port | Built from |
|---|---|---|---|
| Traefik | traefik:v3.6 | :80 web, :8080 dashboard/metrics | config/traefik.yml + config/dynamic/ mounted |
| Portal | static build → nginxinc/nginx-unprivileged:1.27-alpine | :8080 | portal/Dockerfile |
| Local-dev gateway | nginx:1.27-alpine (local-nx-gateway) | :80 (host net) | local/docker-compose.yml + local/nginx.conf |
Portal build & run
cd packages/gateway/portal
bun install # separate dependency tree from monorepo root
bun run dev # Astro dev server on :3003
bun run rebuild # clean + production static build
bun run lint # scripts/lint.shThe Traefik and Nginx configs require no build step — they are mounted directly. Backend services self-register with Traefik via Docker labels; no gateway redeploy is needed when a service is added (see Routing).
Native-dev gateway
# Linux only — host networking lets Nginx reach 127.0.0.1:31xx and bind :80
docker compose -f packages/gateway/local/docker-compose.yml up -d
curl http://localhost/__gateway_health # → {"status":"ok","gateway":"local-nx-gateway"}2. Observability
| Signal | Source | Where to look |
|---|---|---|
| Access logs | Traefik JSON access log (Authorization/Cookie dropped) | container stdout / Loki |
| App logs | Traefik JSON log, level: INFO | container stdout |
| Metrics | Prometheus on Traefik :8080 | Prometheus + Grafana |
| Dashboard | Traefik dashboard :8080 (basic-auth) | /dashboard/ |
| Per-service health | gateway-portal | portal Monitor page |
Full metric list, Prometheus scrape config, and Grafana provisioning: see Observability. Key metrics: traefik_service_requests_total, traefik_service_request_duration_seconds, traefik_service_server_up.
3. Security
| Concern | Mitigation | Source |
|---|---|---|
| TLS | Terminated at edge Nginx — Traefik has no HTTPS entrypoint | config/traefik.yml |
| Rate limit (general) | rate-limit 200/s, burst 400, per-IP | middlewares.yml |
| Rate limit (auth) | rate-limit-auth 30/min, burst 60, per-IP | middlewares.yml |
| Circuit breaker | NetworkErrorRatio() > 0.10 || LatencyAtQuantileMS(95.0) > 3000 | middlewares.yml |
| Security headers | XSS filter, nosniff, frame deny; strip Server/X-Powered-By | middlewares.yml |
| Dashboard / portal access | basic auth (dashboard-auth) | middlewares.yml |
| Docker socket | read-only mount; exposedByDefault: false | config/traefik.yml |
| Real client IP | ipStrategy.depth=1 reads X-Forwarded-For behind Nginx | middlewares.yml |
Per-IP rate limiting only works correctly because
ipStrategy.depth=1extracts the real client IP — without it every request would share Nginx's IP bucket. See Resilience.
4. Runbook
4.1 Alert classes (recommended — not yet deployed)
| Alert | Trigger | Check | Fix |
|---|---|---|---|
GatewayHighErrorRate | 5xx ratio > 5% over 5m | Traefik access logs, target service health | Inspect the failing backend; circuit breaker may already be open |
GatewayHighLatency | p99 > 2s over 5m | per-service latency histogram | Backend slowness; check DB / downstream |
BackendDown | traefik_service_server_up == 0 | service /health | Restart/scale the backend; Traefik auto-re-adds on recovery |
Example PromQL + alert rules: see Observability §8.
4.2 Common operations
| Operation | Command / action |
|---|---|
| Tail Traefik logs | docker logs -f <traefik-container> |
| Verify a route registered | Traefik dashboard /dashboard/ → HTTP Routers |
| Reload shared middleware | edit config/dynamic/middlewares.yml — file provider hot-reloads (no restart) |
| Add a backend route (prod) | add Traefik labels to the service compose; auto-discovered |
| Add a backend route (dev) | add upstream + location /v1/api/<svc>/ to local/nginx.conf, restart local-nx-gateway |
| Add a service to the portal | add an entry to portal/src/constants/services.constant.ts, rebuild portal |
| Check dev gateway liveness | curl http://localhost/__gateway_health |
| Inspect a tripped circuit | dashboard → router middlewares; watch NetworkErrorRatio / p95 latency |
5. Related Pages
- Configuration
- Observability
- Resilience
/runbook/— central runbook for cross-service incidents- Decisions