Skip to content

Operations

1. Deployment

Single image, deployed in up to two roles via APP_ENV_APPLICATION_ROLES. The api role serves REST + enqueues; the worker role consumes Kafka, renders, encrypts, uploads, and runs the recovery sweep + WebSocket emitter.

PropertyValue
Imageregistry/ledger:<tag>
Rolesapi, worker, or both (APP_ENV_APPLICATION_ROLES)
Container Port3000 (external 31060)
ProbesGET /healthz (live), GET /readyz (ready)
Snowflake ID6 (APP_ENV_NODE_ID)
Migration modeRUN_MODE=migrate (migrate.ts) — skips components/services/controllers
ScalingScale worker replicas and/or APP_ENV_KAFKA_CONSUMER_COUNT for generation throughput

Traefik labels

yaml
labels:
  - "traefik.enable=true"
  - "traefik.http.routers.ledger.rule=PathPrefix(`/v1/api/ledger`)"
  - "traefik.http.services.ledger.loadbalancer.server.port=3000"

Only the api role needs an ingress route. worker-only pods take no HTTP traffic.

2. Observability

SignalSourceWhere to look
Logsstdout (structured key-value)kubectl logs <pod> / Loki
Pipeline phaseslog lines FETCH_* / GENERATE_* / UPLOAD_DONE / COMPLETED / FAILED per ledgerIdworker pods
HealthGET /healthz, GET /readyzGateway portal
WS emit[notifyJobStatus] lines (warn on emitter-not-ready)worker pods

Key log fields

FieldSourceNotes
ledgerIdmessage valuePrimary correlation key across the pipeline
merchantId / type / periodloaded ledgerJob identity
attemptCountLedgerJobLifetime retry count
errorCodefailureReasonSee runbook

3. Security

ConcernMitigation
AuthNJWT (ES256, JWKS from identity) — VerifierApplication
AuthZCasbin via PolicyDefinition (Redis-cached); every endpoint calls assertMerchantAccess
File encryptionAES-256-GCM at rest in S3 (APP_ENV_LEDGER_ENCRYPTION_KEY); files served only via authenticated download
Download hardeningContent-Disposition + X-Content-Type-Options: nosniff; decrypt in-memory, no public S3 URLs
SecretsK8s Secret as env (APP_ENV_LEDGER_ENCRYPTION_KEY, Kafka SASL, S3 keys)
Network policyCilium — allow gateway + Kafka + Redis + S3 only
Soft-deletedeletedAt — no hard-delete of ledgers

4. Runbook

4.1 Alert classes

AlertTriggerCheckFixEscalate
ledgerJobRejectedSpikeLedgerJob.status=REJECTED rate uplogs FAILED/errorCoderetry per error code (below)on-call backend
ledgerStalledJobsmany jobs reset by recovery sweep[RecoveryComponent] Found N stalled jobcheck worker health / APP_ENV_JOB_TIMEOUT_MSon-call backend
ledgerConsumerLagledger.generate lag growsconsumer logs / Kafka lagscale workers / APP_ENV_KAFKA_CONSUMER_COUNTon-call SRE
ledgerWsNotReadyrepeated WS emitter not readyworker WS Redis connectioncheck APP_ENV_WEBSOCKET_REDIS_*on-call SRE

4.2 Failure codes (failureReason.errorCode)

CodeMeaningAction
FETCH_DATA_ERRORSource data fetch/parse failed (Zod)inspect source data; manual retry
JOB_EXECUTION_FAILEDGeneric pipeline error (render/upload)check logs; retry
ENQUEUE_FAILEDKafka producer could not publishcheck broker/SASL; retry
JOB_IN_PROGRESSRetry/regenerate while PENDING/PROCESSINGwait for completion
JOB_NOT_READYDownload before COMPLETEDwait/retry generation

4.3 Common operations

OperationCommand / Action
Tail worker logskubectl logs -n <ns> -f deploy/ledger-worker
Manual retryPOST /v1/api/ledger/ledgers/:id/retry
Regenerate DRAFTPOST /v1/api/ledger/ledgers/:id/regenerate
Force stalled recoverywait for sweep (APP_ENV_SWEEP_INTERVAL_MS) or restart worker (initial sweep on boot)
Run migrationRUN_MODE=migrate job / bun run migrate:dev (dev)

Re-processing is never automatic — a committed Kafka message is not replayed. Use retry/regenerate or the recovery sweep.

Proprietary and Confidential. Unauthorized copying, distribution, or use of this software is strictly prohibited.