ADR-0001. Kafka-driven async ledger generation via a self-loop topic
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-03-30 |
| Deciders | ledger-team |
| Supersedes | — |
Context
- Generating a ledger is slow and bursty: data fetch + Typst PDF render + ExcelJS XLSX render + AES encryption + S3 upload, easily seconds per document, multiplied across a full-year batch.
- A synchronous HTTP request cannot hold open for that long, and a crashed request would leave partial S3 files and an ambiguous status.
- Generation must be retriable and survive worker crashes mid-pipeline.
Decision
We will decouple enqueue from execution with a single Kafka topic ledger.generate that the service both produces to (api role) and consumes from (worker role). The HTTP request returns immediately with a LedgerJob in PENDING; the worker executes handleGeneration(ledgerId) and reports progress via WebSocket.
The consumer runs with autocommit: false and commits only after upload + finalize succeed. Job state is a separate LedgerJob machine (PENDING → PROCESSING → COMPLETED|REJECTED) claimed via atomic conditional UPDATE.
Consequences
| Pros | Cons |
|---|---|
| Fast HTTP response; long render off the request path | Eventual consistency — client must poll/subscribe for status |
Idempotent enqueue on (merchantId, type, period) | A committed message is never auto-replayed; recovery needs explicit retry or the stall sweep |
| Horizontal scale via consumer count / worker replicas | Operators must understand the self-loop (no external producer/consumer) |
Crash recovery via RecoveryComponent re-enqueue of stalled jobs | Slightly more moving parts than a BullMQ queue |
Alternatives Considered
| Option | Pros | Cons | Why rejected |
|---|---|---|---|
| Synchronous HTTP generation | Simplest | Long-held connections, no crash recovery, partial files | Unworkable for batch/large ledgers |
| BullMQ queue | Built-in retries/backoff | Another infra surface; Kafka already in the stack | Reused existing Kafka rather than add Redis-queue semantics |
| Auto-replay on consume failure | Self-healing | Risk of poison-message storms on deterministic parse failures | Manual retry + bounded stall sweep is safer |
References
ledger/src/services/ledger-queue.service.ts(handleEnqueueGeneration)ledger/src/services/ledger-worker.service.ts(handleGeneration)ledger/src/components/kafka.component.ts(consumerautocommit: false)ledger/src/components/recovery.component.ts(stall sweep)- Generation Pipeline