Skip to content

Gateway Observability

1. Overview

Traefik provides built-in observability through Prometheus metrics, structured JSON access logs, and a real-time dashboard. No custom code or additional exporters needed.

Source: packages/gateway/config/traefik.yml (metrics + access log config), packages/gateway/config/dynamic/middlewares.yml (dashboard routers), infrastructure/deployments/develop/monitoring/docker-compose.yml (Prometheus + Grafana)

2. Observability Stack

3. Prometheus Metrics

Configuration

Source: packages/gateway/config/traefik.yml (lines 37–43)

yaml
metrics:
  prometheus:
    entryPoint: traefik
    addEntryPointsLabels: true
    addRoutersLabels: true
    addServicesLabels: true

The traefik entrypoint listens on :8080 internally, mapped to host port 30100.

Scrape Configuration

Source: infrastructure/deployments/develop/monitoring/config/prometheus/prometheus.yml

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "traefik"
    static_configs:
      - targets: ["dev-nx-gateway:8080"]
        labels:
          instance: "gateway"

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Prometheus scrapes Traefik metrics every 15 seconds via the Docker network (dev-nx-gateway:8080). It also monitors itself at localhost:9090.

Available Metrics

MetricTypeDescription
traefik_entrypoint_requests_totalCounterTotal requests per entrypoint
traefik_entrypoint_request_duration_secondsHistogramRequest duration per entrypoint
traefik_router_requests_totalCounterTotal requests per router (service)
traefik_router_request_duration_secondsHistogramRequest duration per router
traefik_service_requests_totalCounterTotal requests per backend service
traefik_service_request_duration_secondsHistogramDuration per backend service
traefik_service_open_connectionsGaugeCurrent open connections per service
traefik_service_server_upGaugeHealth status per backend server (1=up, 0=down)

Example Prometheus Queries

txt
# Request rate per service (last 5 minutes)
sum(rate(traefik_service_requests_total[5m])) by (service)

# Error rate (5xx responses)
sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
/ sum(rate(traefik_service_requests_total[5m]))

# P99 latency per service
histogram_quantile(0.99, rate(traefik_service_request_duration_seconds_bucket[5m]))

# Unhealthy backends
traefik_service_server_up == 0

4. Structured Access Logs

Configuration

Source: packages/gateway/config/traefik.yml (lines 29–35)

yaml
accessLog:
  format: json
  fields:
    headers:
      names:
        Authorization: drop
        Cookie: drop

Log Format

Each request produces a JSON log entry:

json
{
  "level": "info",
  "msg": "",
  "ClientAddr": "192.168.1.100:45678",
  "ClientHost": "192.168.1.100",
  "Duration": 145000000,
  "RequestMethod": "GET",
  "RequestPath": "/v1/api/commerce/products",
  "OriginStatus": 200,
  "ServiceName": "commerce@docker",
  "RouterName": "commerce@docker",
  "time": "2025-01-20T10:00:00Z"
}

Sensitive headers (Authorization, Cookie) are automatically dropped from log output.

Application Logs

Source: packages/gateway/config/traefik.yml (lines 25–27)

yaml
log:
  level: INFO
  format: json

Traefik application logs (startup, configuration changes, errors) are also in JSON format at INFO level.

5. Traefik Dashboard

Static Configuration

Source: packages/gateway/config/traefik.yml (lines 7–8)

yaml
api:
  dashboard: true

The dashboard is enabled but NOT exposed with api.insecure: true. Authentication is handled by file provider routers.

Dashboard Routers

Source: packages/gateway/config/dynamic/middlewares.yml (lines 8–23)

yaml
http:
  routers:
    dashboard:
      rule: "PathPrefix(`/api`) || PathPrefix(`/dashboard`)"
      entryPoints:
        - traefik
      service: api@internal
      middlewares:
        - dashboard-auth
    dashboard-redirect:
      rule: "Path(`/`)"
      entryPoints:
        - traefik
      service: api@internal
      middlewares:
        - redirect-to-dashboard
        - dashboard-auth

Access

Available at http://localhost:30100 (mapped from internal port 8080). Protected by HTTP Basic Authentication (user: nx.eventry).

Dashboard Shows

  • All registered routers and their rules
  • Backend services and their health status
  • Middleware chains applied to each router
  • Real-time request metrics

6. Grafana Integration

Container Configuration

Source: infrastructure/deployments/develop/monitoring/docker-compose.yml

PropertyValue
Imagegrafana/grafana:11.5.2
Host Port39300 (internal :3000)
Admin Useradmin
Admin Passwordadmin
Sign UpDisabled
Default Dashboardtraefik-overview.json

Prometheus Data Source

Source: infrastructure/deployments/develop/monitoring/config/grafana/provisioning/datasources/prometheus.yml

yaml
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://dev-nx-prometheus:9090
    isDefault: true
    editable: false

Dashboard Provisioning

Source: infrastructure/deployments/develop/monitoring/config/grafana/provisioning/dashboards/dashboards.yml

yaml
providers:
  - name: "BANA"
    orgId: 1
    folder: "BANA"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false

A pre-built traefik-overview.json dashboard is provisioned automatically at infrastructure/deployments/develop/monitoring/config/grafana/dashboards/traefik-overview.json.

7. Prometheus Container Configuration

Source: infrastructure/deployments/develop/monitoring/docker-compose.yml

PropertyValue
Imageprom/prometheus:v3.2.1
Host Port39090 (internal :9090)
Retention30 days (--storage.tsdb.retention.time=30d)
Lifecycle APIEnabled (--web.enable-lifecycle)
Config/etc/prometheus/prometheus.yml

The following Prometheus alert rules are recommended for production monitoring. These are not currently deployed — add them to a Prometheus alerting rules file when ready:

yaml
groups:
  - name: gateway
    rules:
      - alert: HighErrorRate
        expr: >
          sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
          / sum(rate(traefik_service_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High 5xx error rate on gateway"

      - alert: HighLatency
        expr: >
          histogram_quantile(0.99,
            rate(traefik_service_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency exceeds 2 seconds"

      - alert: ServiceDown
        expr: traefik_service_server_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Backend service {{ $labels.service }} is down"
DocumentDescription
Gateway OverviewIdentity card + service catalog
OperationsDeploy, runbook, alert classes
MiddlewaresFull middleware definitions from source
ResilienceCircuit breaker states, health checks, retry
DecisionsADRs

Proprietary and Confidential. Unauthorized copying, distribution, or use of this software is strictly prohibited.