ADR-0003. SLA enforcement via repeatable cron monitor + escalation worker

Field	Value
Status	Accepted
Date	2026-04-10
Deciders	Phat Nguyen
Supersedes	-

Context

Each ticket has two deadlines (first-response, resolution) computed from its SlaPolicy. The service must detect approaching deadlines (warning), breaches, and critical breaches, then notify and escalate.
Deadline detection is inherently time-driven: nothing in the request path tells the system "this ticket just crossed 75% of its SLA." Options are per-ticket delayed jobs scheduled at each threshold, or a periodic sweep.
Escalation (Level 1→2→3) must be reliable and retryable, with a critical priority over routine work.

Decision

Use a repeatable BullMQ cron job plus a dedicated escalation worker:

QueueComponent.scheduleSlaMonitoring() registers a repeatable job on the helpdesk.sla-monitor queue with pattern: WORKER_CONFIG.SLA_MONITOR_INTERVAL (every minute) and a fixed jobId: 'sla-monitor-cron' so duplicates can't accumulate. On boot it removes stale repeatables first.
sla-monitor.worker (concurrency 1) runs RunSlaMonitorUseCase, scanning SlaTracker rows in batches (SLA_BATCH_SIZE = 100) for warning/breach against SLA_WARNING_THRESHOLDS (75/90/100/150).
Breaches enqueue notification jobs and escalation jobs on helpdesk.escalation (concurrency 5, priority 1, 3 retries with exponential backoff), consumed by escalation.worker → ProcessEscalationUseCase.
Manual checks are possible via QueueComponent.triggerSlaCheck({ ticketId }) at HIGH priority.

Consequences

Pros	Cons
One cron sweep handles all tickets - no per-ticket scheduling sprawl	Up to ~1 minute detection latency (the cron interval)
Single fixed `jobId` prevents duplicate repeatables across restarts	Sweep cost scales with open-ticket volume (mitigated by batching)
Escalation isolated on its own high-priority queue with retries	Concurrency-1 monitor is a throughput ceiling for very large tenants
Manual trigger path for targeted re-checks	Worker process must be running, or SLA enforcement silently stops

Note: Level-2+ senior-agent reassignment is currently disabled in code (commented-out assignTicketUseCase call) - the escalation worker only sends notifications. See Operations → Known Issues.

Alternatives Considered

Option	Why rejected
Per-ticket delayed jobs at each threshold	Explosion of scheduled jobs; messy to reschedule on policy/priority change
External cron / k8s CronJob hitting an endpoint	Adds infra coupling; BullMQ repeatable keeps scheduling in-app and observable
DB trigger / pg_cron	Pushes business logic into the database; hard to test and observe

References

src/components/queue.component.ts (scheduleSlaMonitoring, triggerSlaCheck)
src/components/workers/sla-monitor.worker.ts, escalation.worker.ts
src/application/use-cases/sla-policy/run-sla-monitor.use-case.ts, process-escalation.use-case.ts
src/shared/common/constants/common.constant.ts (WORKER_CONFIG, SLA_WARNING_THRESHOLDS, ESCALATION_TIMING)

Providers

Invoice Types

ADR-0003. SLA enforcement via repeatable cron monitor + escalation worker ​

Context ​

Decision ​

Consequences ​

Alternatives Considered ​

References ​

ADR-0003. SLA enforcement via repeatable cron monitor + escalation worker

Context

Decision

Consequences

Alternatives Considered

References