ADR-0003. SLA enforcement via repeatable cron monitor + escalation worker
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-04-10 |
| Deciders | support-team |
| Supersedes | — |
Context
- Each ticket has two deadlines (first-response, resolution) computed from its
SlaPolicy. The service must detect approaching deadlines (warning), breaches, and critical breaches, then notify and escalate. - Deadline detection is inherently time-driven: nothing in the request path tells the system "this ticket just crossed 75% of its SLA." Options are per-ticket delayed jobs scheduled at each threshold, or a periodic sweep.
- Escalation (Level 1→2→3) must be reliable and retryable, with a critical priority over routine work.
Decision
Use a repeatable BullMQ cron job plus a dedicated escalation worker:
QueueComponent.scheduleSlaMonitoring()registers a repeatable job on thehelpdesk.sla-monitorqueue withpattern: WORKER_CONFIG.SLA_MONITOR_INTERVAL(every minute) and a fixedjobId: 'sla-monitor-cron'so duplicates can't accumulate. On boot it removes stale repeatables first.sla-monitor.worker(concurrency 1) runsRunSlaMonitorUseCase, scanningSlaTrackerrows in batches (SLA_BATCH_SIZE = 100) for warning/breach againstSLA_WARNING_THRESHOLDS(75/90/100/150).- Breaches enqueue notification jobs and escalation jobs on
helpdesk.escalation(concurrency 5, priority 1, 3 retries with exponential backoff), consumed byescalation.worker→ProcessEscalationUseCase. - Manual checks are possible via
QueueComponent.triggerSlaCheck({ ticketId })at HIGH priority.
Consequences
| Pros | Cons |
|---|---|
| One cron sweep handles all tickets — no per-ticket scheduling sprawl | Up to ~1 minute detection latency (the cron interval) |
Single fixed jobId prevents duplicate repeatables across restarts | Sweep cost scales with open-ticket volume (mitigated by batching) |
| Escalation isolated on its own high-priority queue with retries | Concurrency-1 monitor is a throughput ceiling for very large tenants |
| Manual trigger path for targeted re-checks | Worker process must be running, or SLA enforcement silently stops |
Note: Level-2+ senior-agent reassignment is currently disabled in code (commented-out
assignTicketUseCasecall) — the escalation worker only sends notifications. See Operations → Known Issues.
Alternatives Considered
| Option | Why rejected |
|---|---|
| Per-ticket delayed jobs at each threshold | Explosion of scheduled jobs; messy to reschedule on policy/priority change |
| External cron / k8s CronJob hitting an endpoint | Adds infra coupling; BullMQ repeatable keeps scheduling in-app and observable |
| DB trigger / pg_cron | Pushes business logic into the database; hard to test and observe |
References
src/components/queue.component.ts(scheduleSlaMonitoring,triggerSlaCheck)src/components/workers/sla-monitor.worker.ts,escalation.worker.tssrc/application/use-cases/sla-policy/run-sla-monitor.use-case.ts,process-escalation.use-case.tssrc/shared/common/constants/common.constant.ts(WORKER_CONFIG,SLA_WARNING_THRESHOLDS,ESCALATION_TIMING)