Skip to content

ADR-0003. SLA enforcement via repeatable cron monitor + escalation worker

FieldValue
StatusAccepted
Date2026-04-10
Deciderssupport-team
Supersedes

Context

  • Each ticket has two deadlines (first-response, resolution) computed from its SlaPolicy. The service must detect approaching deadlines (warning), breaches, and critical breaches, then notify and escalate.
  • Deadline detection is inherently time-driven: nothing in the request path tells the system "this ticket just crossed 75% of its SLA." Options are per-ticket delayed jobs scheduled at each threshold, or a periodic sweep.
  • Escalation (Level 1→2→3) must be reliable and retryable, with a critical priority over routine work.

Decision

Use a repeatable BullMQ cron job plus a dedicated escalation worker:

  1. QueueComponent.scheduleSlaMonitoring() registers a repeatable job on the helpdesk.sla-monitor queue with pattern: WORKER_CONFIG.SLA_MONITOR_INTERVAL (every minute) and a fixed jobId: 'sla-monitor-cron' so duplicates can't accumulate. On boot it removes stale repeatables first.
  2. sla-monitor.worker (concurrency 1) runs RunSlaMonitorUseCase, scanning SlaTracker rows in batches (SLA_BATCH_SIZE = 100) for warning/breach against SLA_WARNING_THRESHOLDS (75/90/100/150).
  3. Breaches enqueue notification jobs and escalation jobs on helpdesk.escalation (concurrency 5, priority 1, 3 retries with exponential backoff), consumed by escalation.workerProcessEscalationUseCase.
  4. Manual checks are possible via QueueComponent.triggerSlaCheck({ ticketId }) at HIGH priority.

Consequences

ProsCons
One cron sweep handles all tickets — no per-ticket scheduling sprawlUp to ~1 minute detection latency (the cron interval)
Single fixed jobId prevents duplicate repeatables across restartsSweep cost scales with open-ticket volume (mitigated by batching)
Escalation isolated on its own high-priority queue with retriesConcurrency-1 monitor is a throughput ceiling for very large tenants
Manual trigger path for targeted re-checksWorker process must be running, or SLA enforcement silently stops

Note: Level-2+ senior-agent reassignment is currently disabled in code (commented-out assignTicketUseCase call) — the escalation worker only sends notifications. See Operations → Known Issues.

Alternatives Considered

OptionWhy rejected
Per-ticket delayed jobs at each thresholdExplosion of scheduled jobs; messy to reschedule on policy/priority change
External cron / k8s CronJob hitting an endpointAdds infra coupling; BullMQ repeatable keeps scheduling in-app and observable
DB trigger / pg_cronPushes business logic into the database; hard to test and observe

References

  • src/components/queue.component.ts (scheduleSlaMonitoring, triggerSlaCheck)
  • src/components/workers/sla-monitor.worker.ts, escalation.worker.ts
  • src/application/use-cases/sla-policy/run-sla-monitor.use-case.ts, process-escalation.use-case.ts
  • src/shared/common/constants/common.constant.ts (WORKER_CONFIG, SLA_WARNING_THRESHOLDS, ESCALATION_TIMING)

Proprietary and Confidential. Unauthorized copying, distribution, or use of this software is strictly prohibited.