Gateway Resilience
1. Overview
Traefik provides built-in resilience patterns — circuit breaker, rate limiting, retry, and active health checks — without plugins or custom code.
Source: packages/gateway/config/dynamic/middlewares.yml (circuit breaker, rate limiting), infrastructure/deployments/develop/*/docker-compose.yml (health checks)
2. Resilience Flow
3. Circuit Breaker
Configuration
Source: packages/gateway/config/dynamic/middlewares.yml (lines 44–49)
circuit-breaker:
circuitBreaker:
# 5xx-ratio term currently commented out in source:
# expression: "ResponseCodeRatio(500, 600, 0, 600) > 0.30 || NetworkErrorRatio() > 0.10 || LatencyAtQuantileMS(95.0) > 3000"
expression: "NetworkErrorRatio() > 0.10 || LatencyAtQuantileMS(95.0) > 3000"
checkPeriod: 5s
fallbackDuration: 15s
recoveryDuration: 30sExpression Breakdown
| Condition | Trigger | Active? |
|---|---|---|
NetworkErrorRatio() > 0.10 | More than 10% of requests result in network errors | Yes |
LatencyAtQuantileMS(95.0) > 3000 | P95 latency exceeds 3 seconds | Yes |
ResponseCodeRatio(500, 600, 0, 600) > 0.30 | More than 30% of responses are 5xx errors | Commented out in source |
The expression uses OR logic — any single active condition triggers the circuit breaker. The 5xx-ratio term is present but commented out in middlewares.yml.
Timing Parameters
| Parameter | Value | Description |
|---|---|---|
checkPeriod | 5s | How often the expression is evaluated |
fallbackDuration | 15s | How long the circuit stays open before attempting recovery |
recoveryDuration | 30s | How long the recovery phase lasts before fully closing |
State Machine
| State | Behavior |
|---|---|
| Closed | Normal operation — requests pass through |
| Open | All requests fail fast with 503 for 15 seconds |
| Recovering | Limited requests allowed to test recovery over 30 seconds |
Available Expression Functions
| Function | Description |
|---|---|
NetworkErrorRatio() | Ratio of network errors to total requests |
ResponseCodeRatio(from, to, dividedByFrom, dividedByTo) | Ratio of response codes in ranges |
LatencyAtQuantileMS(quantile) | Latency at given quantile (0-100) |
4. Rate Limiting
Global Rate Limit
Source: packages/gateway/config/dynamic/middlewares.yml (lines 53–59)
Applied to most services — 200 requests/second per client IP:
rate-limit:
rateLimit:
average: 200
burst: 400
sourceCriterion:
ipStrategy:
depth: 1Auth-Specific Rate Limit
Source: packages/gateway/config/dynamic/middlewares.yml (lines 62–69)
Stricter limit for authentication endpoints — 30 requests/minute per client IP:
rate-limit-auth:
rateLimit:
average: 30
burst: 60
period: 1m
sourceCriterion:
ipStrategy:
depth: 1Per-IP Rate Limiting
Both rate limiters use sourceCriterion.ipStrategy.depth: 1 to extract the real client IP from the X-Forwarded-For header. This is necessary because Traefik sits behind Nginx — without this setting, all requests would appear to come from Nginx's IP and share a single rate limit bucket.
Rate Limit Response
When rate limit is exceeded, Traefik returns 429 Too Many Requests.
5. Active Health Checks
Source: Docker labels in infrastructure/deployments/develop/*/docker-compose.yml
Traefik periodically checks service health endpoints:
# Docker label per service
- "traefik.http.services.identity.loadbalancer.healthcheck.path=/v1/api/identity/health"
- "traefik.http.services.identity.loadbalancer.healthcheck.interval=30s"| Parameter | Value | Description |
|---|---|---|
path | /v1/api/<service>/health | Health check endpoint (IGNIS built-in) |
interval | 30s | Check every 30 seconds |
All 7 backend services (identity, commerce, sale, finance, inventory, payment, signal) have health checks configured at 30-second intervals. If a service fails health checks, Traefik stops routing traffic to it until it recovers.
6. Retry (Per-Service)
Retries can be configured per-service via Docker labels. No service currently configures retry — this is available for future use:
# Example: retry up to 3 times for commerce service
- "traefik.http.middlewares.retry-commerce.retry.attempts=3"
- "traefik.http.middlewares.retry-commerce.retry.initialinterval=100ms"
- "traefik.http.routers.commerce.middlewares=retry-commerce@docker,rate-limit@file,circuit-breaker@file,security-headers@file"7. Related Pages
| Document | Description |
|---|---|
| Gateway Overview | Identity card + service catalog |
| Middlewares | Full middleware definitions from source |
| Observability | Prometheus metrics, access logs, Grafana |
| Operations | Deploy, runbook, alert classes |
| Decisions | ADRs |