Operations
1. Deployment
Single image, deployed in up to two roles via
APP_ENV_APPLICATION_ROLES. Theapirole serves REST + enqueues; theworkerrole consumes Kafka, renders, encrypts, uploads, and runs the recovery sweep + WebSocket emitter.
| Property | Value |
|---|---|
| Image | registry/ledger:<tag> |
| Roles | api, worker, or both (APP_ENV_APPLICATION_ROLES) |
| Container Port | 3000 (external 31060) |
| Probes | GET /healthz (live), GET /readyz (ready) |
| Snowflake ID | 6 (APP_ENV_NODE_ID) |
| Migration mode | RUN_MODE=migrate (migrate.ts) — skips components/services/controllers |
| Scaling | Scale worker replicas and/or APP_ENV_KAFKA_CONSUMER_COUNT for generation throughput |
Traefik labels
yaml
labels:
- "traefik.enable=true"
- "traefik.http.routers.ledger.rule=PathPrefix(`/v1/api/ledger`)"
- "traefik.http.services.ledger.loadbalancer.server.port=3000"Only the
apirole needs an ingress route.worker-only pods take no HTTP traffic.
2. Observability
| Signal | Source | Where to look |
|---|---|---|
| Logs | stdout (structured key-value) | kubectl logs <pod> / Loki |
| Pipeline phases | log lines FETCH_* / GENERATE_* / UPLOAD_DONE / COMPLETED / FAILED per ledgerId | worker pods |
| Health | GET /healthz, GET /readyz | Gateway portal |
| WS emit | [notifyJobStatus] lines (warn on emitter-not-ready) | worker pods |
Key log fields
| Field | Source | Notes |
|---|---|---|
ledgerId | message value | Primary correlation key across the pipeline |
merchantId / type / period | loaded ledger | Job identity |
attemptCount | LedgerJob | Lifetime retry count |
errorCode | failureReason | See runbook |
3. Security
| Concern | Mitigation |
|---|---|
| AuthN | JWT (ES256, JWKS from identity) — VerifierApplication |
| AuthZ | Casbin via PolicyDefinition (Redis-cached); every endpoint calls assertMerchantAccess |
| File encryption | AES-256-GCM at rest in S3 (APP_ENV_LEDGER_ENCRYPTION_KEY); files served only via authenticated download |
| Download hardening | Content-Disposition + X-Content-Type-Options: nosniff; decrypt in-memory, no public S3 URLs |
| Secrets | K8s Secret as env (APP_ENV_LEDGER_ENCRYPTION_KEY, Kafka SASL, S3 keys) |
| Network policy | Cilium — allow gateway + Kafka + Redis + S3 only |
| Soft-delete | deletedAt — no hard-delete of ledgers |
4. Runbook
4.1 Alert classes
| Alert | Trigger | Check | Fix | Escalate |
|---|---|---|---|---|
ledgerJobRejectedSpike | LedgerJob.status=REJECTED rate up | logs FAILED/errorCode | retry per error code (below) | on-call backend |
ledgerStalledJobs | many jobs reset by recovery sweep | [RecoveryComponent] Found N stalled job | check worker health / APP_ENV_JOB_TIMEOUT_MS | on-call backend |
ledgerConsumerLag | ledger.generate lag grows | consumer logs / Kafka lag | scale workers / APP_ENV_KAFKA_CONSUMER_COUNT | on-call SRE |
ledgerWsNotReady | repeated WS emitter not ready | worker WS Redis connection | check APP_ENV_WEBSOCKET_REDIS_* | on-call SRE |
4.2 Failure codes (failureReason.errorCode)
| Code | Meaning | Action |
|---|---|---|
FETCH_DATA_ERROR | Source data fetch/parse failed (Zod) | inspect source data; manual retry |
JOB_EXECUTION_FAILED | Generic pipeline error (render/upload) | check logs; retry |
ENQUEUE_FAILED | Kafka producer could not publish | check broker/SASL; retry |
JOB_IN_PROGRESS | Retry/regenerate while PENDING/PROCESSING | wait for completion |
JOB_NOT_READY | Download before COMPLETED | wait/retry generation |
4.3 Common operations
| Operation | Command / Action |
|---|---|
| Tail worker logs | kubectl logs -n <ns> -f deploy/ledger-worker |
| Manual retry | POST /v1/api/ledger/ledgers/:id/retry |
| Regenerate DRAFT | POST /v1/api/ledger/ledgers/:id/regenerate |
| Force stalled recovery | wait for sweep (APP_ENV_SWEEP_INTERVAL_MS) or restart worker (initial sweep on boot) |
| Run migration | RUN_MODE=migrate job / bun run migrate:dev (dev) |
Re-processing is never automatic — a committed Kafka message is not replayed. Use retry/regenerate or the recovery sweep.
5. Related Pages
- Configuration
- Generation Pipeline
/runbook/— central runbook for cross-service incidents- Decisions