Skip to content

Observability

The monitoring stack runs in the nx-watcher namespace. Staging provides basic metrics and logs; production adds distributed tracing with Tempo and OpenTelemetry Collector.

Architecture Overview

Staging

Not Yet Deployed

The staging observability stack (Prometheus, Grafana, Loki, Promtail) is planned but not yet deployed. The nx-watcher namespace exists but contains no pods. The manifests below describe the target architecture.

Production

Staging vs Production

ComponentStagingProduction
PrometheusYesYes
GrafanaYesYes
Loki + PromtailYesYes
TempoNoYes
OTel CollectorNoYes
Node schedulingdefault pool (no taint)Dedicated monitoring node (dedicated=monitoring:NoSchedule)
AlertmanagerOptionalYes

Node Scheduling

Staging

Monitoring pods run on default nodes alongside application workloads. No taints or tolerations needed.

yaml
spec:
  nodeSelector:
    node.kubernetes.io/pool: default

Production

Monitoring pods run on a dedicated node with taint dedicated=monitoring:NoSchedule. All monitoring workloads must include the toleration.

Monitoring pods run on a dedicated node with taint `dedicate...
yaml
spec:
  nodeSelector:
    node.kubernetes.io/pool: monitoring
  tolerations:
    - key: dedicated
      operator: Equal
      value: monitoring
      effect: NoSchedule

Prometheus

Collects metrics from all services, Traefik, and data layer components.

Deployment

StatefulSet: nx-prometheus
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nx-prometheus
  namespace: nx-watcher
spec:
  serviceName: nx-prometheus
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: prometheus
  template:
    spec:
      serviceAccountName: prometheus
      nodeSelector:
        node.kubernetes.io/pool: default  # staging: default, production: monitoring
      # Production only:
      # tolerations:
      #   - key: dedicated
      #     operator: Equal
      #     value: monitoring
      #     effect: NoSchedule
      containers:
        - name: prometheus
          image: prom/prometheus:v3.2.1
          args:
            - --config.file=/etc/prometheus/prometheus.yml
            - --storage.tsdb.path=/prometheus
            - --storage.tsdb.retention.time=30d
            - --web.enable-lifecycle
          ports:
            - containerPort: 9090
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: prometheus-config
              mountPath: /etc/prometheus
            - name: prometheus-data
              mountPath: /prometheus
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
  volumeClaimTemplates:
    - metadata:
        name: prometheus-data
      spec:
        accessModes: [ReadWriteOnce]
        storageClassName: csi-sc-vnpaycloud
        resources:
          requests:
            storage: 10Gi

Scrape Configuration

ConfigMap: prometheus-config
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: nx-watcher
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      # Traefik metrics (in nx-internal)
      - job_name: traefik
        static_configs:
          - targets: ['nx-traefik.nx-internal.svc.cluster.local:8080']
        metrics_path: /metrics

      # Backend services (auto-discovery via K8s service discovery)
      - job_name: nx-backend
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names: [nx-backend]
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
            regex: backend
            action: keep
          - source_labels: [__meta_kubernetes_pod_name]
            target_label: pod
          - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
            target_label: service

      # PostgreSQL exporter
      - job_name: postgresql
        static_configs:
          - targets: ['nx-pg-exporter.nx-persistent.svc.cluster.local:9187']

      # Redis exporter
      - job_name: redis
        static_configs:
          - targets: ['nx-redis-exporter.nx-broker.svc.cluster.local:9121']

      # Kafka (JMX exporter)
      - job_name: kafka
        static_configs:
          - targets:
              - 'nx-kafka-0.nx-kafka-headless.nx-broker.svc.cluster.local:9404'
              - 'nx-kafka-1.nx-kafka-headless.nx-broker.svc.cluster.local:9404'
              - 'nx-kafka-2.nx-kafka-headless.nx-broker.svc.cluster.local:9404'

    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['nx-alertmanager.nx-watcher.svc.cluster.local:9093']

    rule_files:
      - /etc/prometheus/alerts/*.yml

Grafana

Deployment

Deployment: nx-grafana
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nx-grafana
  namespace: nx-watcher
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: grafana
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/pool: default  # staging: default, production: monitoring
      # Production only:
      # tolerations:
      #   - key: dedicated
      #     operator: Equal
      #     value: monitoring
      #     effect: NoSchedule
      containers:
        - name: grafana
          image: grafana/grafana:11.5.2
          ports:
            - containerPort: 3000
          env:
            - name: GF_SECURITY_ADMIN_USER
              valueFrom:
                secretKeyRef:
                  name: grafana-secret
                  key: admin-user
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana-secret
                  key: admin-password
            - name: GF_USERS_ALLOW_SIGN_UP
              value: "false"
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 300m
              memory: 512Mi
          volumeMounts:
            - name: grafana-data
              mountPath: /var/lib/grafana
            - name: grafana-datasources
              mountPath: /etc/grafana/provisioning/datasources
            - name: grafana-dashboards-config
              mountPath: /etc/grafana/provisioning/dashboards
            - name: grafana-dashboards
              mountPath: /var/lib/grafana/dashboards
      volumes:
        - name: grafana-data
          persistentVolumeClaim:
            claimName: grafana-data
        - name: grafana-datasources
          configMap:
            name: grafana-datasources
        - name: grafana-dashboards-config
          configMap:
            name: grafana-dashboards-config
        - name: grafana-dashboards
          configMap:
            name: grafana-dashboards

Datasources

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: nx-watcher
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://nx-prometheus.nx-watcher.svc.cluster.local:9090
        isDefault: true
      - name: Loki
        type: loki
        access: proxy
        url: http://nx-loki.nx-watcher.svc.cluster.local:3100
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: nx-watcher
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://nx-prometheus.nx-watcher.svc.cluster.local:9090
        isDefault: true
      - name: Loki
        type: loki
        access: proxy
        url: http://nx-loki.nx-watcher.svc.cluster.local:3100
      - name: Tempo
        type: tempo
        access: proxy
        url: http://nx-tempo.nx-watcher.svc.cluster.local:3200
        jsonData:
          tracesToLogsV2:
            datasourceUid: loki
            filterByTraceID: true
          tracesToMetrics:
            datasourceUid: prometheus
          serviceMap:
            datasourceUid: prometheus

Dashboards

DashboardDescription
Traefik OverviewRequest rate, latency (p50/p95/p99), error rate, active connections
Backend ServicesPer-service CPU, memory, request count, response times
PostgreSQLActive connections, query duration, cache hit ratio, disk usage
RedisMemory usage, hit/miss ratio, connected clients, command rate
KafkaBroker health, under-replicated partitions, consumer lag, throughput
Node ResourcesCPU, memory, disk, network per node
BullMQ JobsQueue depth, processing rate, failed jobs, retry counts
Service Map(Production only) Auto-generated from Tempo traces

Loki + Promtail

Log aggregation using Loki for storage and Promtail as the collection agent.

Loki Deployment

StatefulSet: nx-loki
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nx-loki
  namespace: nx-watcher
spec:
  serviceName: nx-loki
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: loki
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/pool: default  # staging: default, production: monitoring
      # Production only:
      # tolerations:
      #   - key: dedicated
      #     operator: Equal
      #     value: monitoring
      #     effect: NoSchedule
      containers:
        - name: loki
          image: grafana/loki:3.4.2
          args:
            - -config.file=/etc/loki/loki.yaml
          ports:
            - containerPort: 3100
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 300m
              memory: 512Mi
          volumeMounts:
            - name: loki-config
              mountPath: /etc/loki
            - name: loki-data
              mountPath: /loki
      volumes:
        - name: loki-config
          configMap:
            name: loki-config
  volumeClaimTemplates:
    - metadata:
        name: loki-data
      spec:
        accessModes: [ReadWriteOnce]
        storageClassName: csi-sc-vnpaycloud
        resources:
          requests:
            storage: 5Gi

Promtail DaemonSet

Promtail runs on all nodes (including system, stateful, and monitoring nodes in production) to collect container logs. It uses tolerations to schedule on tainted nodes.

DaemonSet: promtail
yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: nx-watcher
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: promtail
  template:
    spec:
      tolerations:
        # Tolerate ALL taints so Promtail runs on every node
        - operator: Exists
      containers:
        - name: promtail
          image: grafana/promtail:3.4.2
          args:
            - -config.file=/etc/promtail/promtail.yaml
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
            - name: promtail-config
              mountPath: /etc/promtail
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: containers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: promtail-config
          configMap:
            name: promtail-config
        - name: varlog
          hostPath:
            path: /var/log
        - name: containers
          hostPath:
            path: /var/lib/docker/containers

INFO

The tolerations: [{ operator: Exists }] entry means Promtail tolerates all taints, including dedicated=system:NoSchedule on system nodes and dedicated=monitoring:NoSchedule on the monitoring node. This ensures logs are collected from every node in the cluster.

Promtail Configuration

ConfigMap: promtail-config
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: nx-watcher
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080

    positions:
      filename: /tmp/positions.yaml

    clients:
      - url: http://nx-loki.nx-watcher.svc.cluster.local:3100/loki/api/v1/push

    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
            target_label: service
          - source_labels: [__meta_kubernetes_namespace]
            target_label: namespace
          - source_labels: [__meta_kubernetes_pod_name]
            target_label: pod
        pipeline_stages:
          - docker: {}

Tempo (Production Only)

Grafana Tempo stores distributed traces sent by the OpenTelemetry Collector.

Deployment

StatefulSet: nx-tempo
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nx-tempo
  namespace: nx-watcher
spec:
  serviceName: nx-tempo
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: tempo
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/pool: monitoring
      tolerations:
        - key: dedicated
          operator: Equal
          value: monitoring
          effect: NoSchedule
      containers:
        - name: tempo
          image: grafana/tempo:2.7.1
          args:
            - -config.file=/etc/tempo/tempo.yaml
          ports:
            - name: http
              containerPort: 3200
            - name: otlp-grpc
              containerPort: 4317
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: tempo-config
              mountPath: /etc/tempo
            - name: tempo-data
              mountPath: /var/tempo
      volumes:
        - name: tempo-config
          configMap:
            name: tempo-config
  volumeClaimTemplates:
    - metadata:
        name: tempo-data
      spec:
        accessModes: [ReadWriteOnce]
        storageClassName: csi-sc-vnpaycloud
        resources:
          requests:
            storage: 10Gi

Tempo Configuration

ConfigMap: tempo-config
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
  namespace: nx-watcher
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317

    storage:
      trace:
        backend: local
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal

    compactor:
      compaction:
        block_retention: 72h

    metrics_generator:
      registry:
        external_labels:
          source: tempo
      storage:
        path: /var/tempo/generator/wal
        remote_write:
          - url: http://nx-prometheus.nx-watcher.svc.cluster.local:9090/api/v1/write

OpenTelemetry Collector (Production Only)

The OTel Collector receives traces from Traefik and backend services via OTLP gRPC and forwards them to Tempo.

Deployment

Deployment: nx-otel-collector
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nx-otel-collector
  namespace: nx-watcher
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: otel-collector
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/pool: monitoring
      tolerations:
        - key: dedicated
          operator: Equal
          value: monitoring
          effect: NoSchedule
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.118.0
          args:
            - --config=/etc/otel/otel-collector.yaml
          ports:
            - name: otlp-grpc
              containerPort: 4317
            - name: otlp-http
              containerPort: 4318
            - name: metrics
              containerPort: 8888
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 300m
              memory: 512Mi
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otel
      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-config

OTel Collector Configuration

ConfigMap: otel-collector-config
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: nx-watcher
data:
  otel-collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 5s
        limit_mib: 400
        spike_limit_mib: 100
      resource:
        attributes:
          - key: cluster
            value: bana-production
            action: upsert

    exporters:
      otlp/tempo:
        endpoint: nx-tempo.nx-watcher.svc.cluster.local:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [otlp/tempo]
      telemetry:
        metrics:
          address: 0.0.0.0:8888

OTel Service

Service: nx-otel-collector
yaml
apiVersion: v1
kind: Service
metadata:
  name: nx-otel-collector
  namespace: nx-watcher
spec:
  selector:
    app.kubernetes.io/name: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
    - name: otlp-http
      port: 4318
      targetPort: 4318
    - name: metrics
      port: 8888
      targetPort: 8888
  type: ClusterIP

Trace Sources

Traces are sent to the OTel Collector from two sources:

Traefik — configured via CLI args in the Traefik Deployment:

yaml
args:
  - --tracing.otlp=true
  - --tracing.otlp.grpc.endpoint=nx-otel-collector.nx-watcher.svc.cluster.local:4317
  - --tracing.otlp.grpc.insecure=true

Backend services — configured via environment variables in the shared ConfigMap:

yaml
OTEL_EXPORTER_OTLP_ENDPOINT: http://nx-otel-collector.nx-watcher.svc.cluster.local:4317
OTEL_SERVICE_NAME: bana

Each IGNIS service picks up the OTEL_EXPORTER_OTLP_ENDPOINT and auto-instruments HTTP requests, database queries, and Redis commands.

Alerting Rules

ConfigMap: prometheus-alerts
yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: nx-watcher
data:
  alerts.yml: |
    groups:
      - name: bana
        rules:
          # Pod crash looping
          - alert: PodCrashLooping
            expr: rate(kube_pod_container_status_restarts_total{namespace=~"nx-.*"}[15m]) > 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.pod }} is crash looping"

          # High error rate
          - alert: HighErrorRate
            expr: |
              sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service)
              / sum(rate(traefik_service_requests_total[5m])) by (service) > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Service {{ $labels.service }} has >5% error rate"

          # PostgreSQL connections near limit
          - alert: PostgreSQLHighConnections
            expr: pg_stat_activity_count > 80
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "PostgreSQL connections above 80"

          # Redis memory usage
          - alert: RedisHighMemory
            expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Redis memory usage above 90%"

          # Kafka under-replicated partitions
          - alert: KafkaUnderReplicated
            expr: kafka_server_replicamanager_underreplicatedpartitions > 0
            for: 10m
            labels:
              severity: critical
            annotations:
              summary: "Kafka has under-replicated partitions"

          # Disk usage
          - alert: HighDiskUsage
            expr: |
              (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "PVC {{ $labels.persistentvolumeclaim }} is above 85% usage"

          # Identity service down
          - alert: IdentityServiceDown
            expr: up{job="nx-backend", service="identity"} == 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Identity service is down — all auth will fail"

Proprietary and Confidential. Unauthorized copying, distribution, or use of this software is strictly prohibited.