Observability
The monitoring stack runs in the nx-watcher namespace. Staging provides basic metrics and logs; production adds distributed tracing with Tempo and OpenTelemetry Collector.
Architecture Overview
Staging
Not Yet Deployed
The staging observability stack (Prometheus, Grafana, Loki, Promtail) is planned but not yet deployed. The nx-watcher namespace exists but contains no pods. The manifests below describe the target architecture.
Production
Staging vs Production
| Component | Staging | Production |
|---|---|---|
| Prometheus | Yes | Yes |
| Grafana | Yes | Yes |
| Loki + Promtail | Yes | Yes |
| Tempo | No | Yes |
| OTel Collector | No | Yes |
| Node scheduling | default pool (no taint) | Dedicated monitoring node (dedicated=monitoring:NoSchedule) |
| Alertmanager | Optional | Yes |
Node Scheduling
Staging
Monitoring pods run on default nodes alongside application workloads. No taints or tolerations needed.
spec:
nodeSelector:
node.kubernetes.io/pool: defaultProduction
Monitoring pods run on a dedicated node with taint dedicated=monitoring:NoSchedule. All monitoring workloads must include the toleration.
Monitoring pods run on a dedicated node with taint `dedicate...
spec:
nodeSelector:
node.kubernetes.io/pool: monitoring
tolerations:
- key: dedicated
operator: Equal
value: monitoring
effect: NoSchedulePrometheus
Collects metrics from all services, Traefik, and data layer components.
Deployment
StatefulSet: nx-prometheus
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nx-prometheus
namespace: nx-watcher
spec:
serviceName: nx-prometheus
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: prometheus
template:
spec:
serviceAccountName: prometheus
nodeSelector:
node.kubernetes.io/pool: default # staging: default, production: monitoring
# Production only:
# tolerations:
# - key: dedicated
# operator: Equal
# value: monitoring
# effect: NoSchedule
containers:
- name: prometheus
image: prom/prometheus:v3.2.1
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle
ports:
- containerPort: 9090
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-data
mountPath: /prometheus
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: csi-sc-vnpaycloud
resources:
requests:
storage: 10GiScrape Configuration
ConfigMap: prometheus-config
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: nx-watcher
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Traefik metrics (in nx-internal)
- job_name: traefik
static_configs:
- targets: ['nx-traefik.nx-internal.svc.cluster.local:8080']
metrics_path: /metrics
# Backend services (auto-discovery via K8s service discovery)
- job_name: nx-backend
kubernetes_sd_configs:
- role: pod
namespaces:
names: [nx-backend]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
regex: backend
action: keep
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
target_label: service
# PostgreSQL exporter
- job_name: postgresql
static_configs:
- targets: ['nx-pg-exporter.nx-persistent.svc.cluster.local:9187']
# Redis exporter
- job_name: redis
static_configs:
- targets: ['nx-redis-exporter.nx-broker.svc.cluster.local:9121']
# Kafka (JMX exporter)
- job_name: kafka
static_configs:
- targets:
- 'nx-kafka-0.nx-kafka-headless.nx-broker.svc.cluster.local:9404'
- 'nx-kafka-1.nx-kafka-headless.nx-broker.svc.cluster.local:9404'
- 'nx-kafka-2.nx-kafka-headless.nx-broker.svc.cluster.local:9404'
alerting:
alertmanagers:
- static_configs:
- targets: ['nx-alertmanager.nx-watcher.svc.cluster.local:9093']
rule_files:
- /etc/prometheus/alerts/*.ymlGrafana
Deployment
Deployment: nx-grafana
apiVersion: apps/v1
kind: Deployment
metadata:
name: nx-grafana
namespace: nx-watcher
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: grafana
template:
spec:
nodeSelector:
node.kubernetes.io/pool: default # staging: default, production: monitoring
# Production only:
# tolerations:
# - key: dedicated
# operator: Equal
# value: monitoring
# effect: NoSchedule
containers:
- name: grafana
image: grafana/grafana:11.5.2
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-user
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
- name: GF_USERS_ALLOW_SIGN_UP
value: "false"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 300m
memory: 512Mi
volumeMounts:
- name: grafana-data
mountPath: /var/lib/grafana
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources
- name: grafana-dashboards-config
mountPath: /etc/grafana/provisioning/dashboards
- name: grafana-dashboards
mountPath: /var/lib/grafana/dashboards
volumes:
- name: grafana-data
persistentVolumeClaim:
claimName: grafana-data
- name: grafana-datasources
configMap:
name: grafana-datasources
- name: grafana-dashboards-config
configMap:
name: grafana-dashboards-config
- name: grafana-dashboards
configMap:
name: grafana-dashboardsDatasources
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: nx-watcher
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://nx-prometheus.nx-watcher.svc.cluster.local:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://nx-loki.nx-watcher.svc.cluster.local:3100apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: nx-watcher
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://nx-prometheus.nx-watcher.svc.cluster.local:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://nx-loki.nx-watcher.svc.cluster.local:3100
- name: Tempo
type: tempo
access: proxy
url: http://nx-tempo.nx-watcher.svc.cluster.local:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
tracesToMetrics:
datasourceUid: prometheus
serviceMap:
datasourceUid: prometheusDashboards
| Dashboard | Description |
|---|---|
| Traefik Overview | Request rate, latency (p50/p95/p99), error rate, active connections |
| Backend Services | Per-service CPU, memory, request count, response times |
| PostgreSQL | Active connections, query duration, cache hit ratio, disk usage |
| Redis | Memory usage, hit/miss ratio, connected clients, command rate |
| Kafka | Broker health, under-replicated partitions, consumer lag, throughput |
| Node Resources | CPU, memory, disk, network per node |
| BullMQ Jobs | Queue depth, processing rate, failed jobs, retry counts |
| Service Map | (Production only) Auto-generated from Tempo traces |
Loki + Promtail
Log aggregation using Loki for storage and Promtail as the collection agent.
Loki Deployment
StatefulSet: nx-loki
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nx-loki
namespace: nx-watcher
spec:
serviceName: nx-loki
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: loki
template:
spec:
nodeSelector:
node.kubernetes.io/pool: default # staging: default, production: monitoring
# Production only:
# tolerations:
# - key: dedicated
# operator: Equal
# value: monitoring
# effect: NoSchedule
containers:
- name: loki
image: grafana/loki:3.4.2
args:
- -config.file=/etc/loki/loki.yaml
ports:
- containerPort: 3100
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 300m
memory: 512Mi
volumeMounts:
- name: loki-config
mountPath: /etc/loki
- name: loki-data
mountPath: /loki
volumes:
- name: loki-config
configMap:
name: loki-config
volumeClaimTemplates:
- metadata:
name: loki-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: csi-sc-vnpaycloud
resources:
requests:
storage: 5GiPromtail DaemonSet
Promtail runs on all nodes (including system, stateful, and monitoring nodes in production) to collect container logs. It uses tolerations to schedule on tainted nodes.
DaemonSet: promtail
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: nx-watcher
spec:
selector:
matchLabels:
app.kubernetes.io/name: promtail
template:
spec:
tolerations:
# Tolerate ALL taints so Promtail runs on every node
- operator: Exists
containers:
- name: promtail
image: grafana/promtail:3.4.2
args:
- -config.file=/etc/promtail/promtail.yaml
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
volumeMounts:
- name: promtail-config
mountPath: /etc/promtail
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: promtail-config
configMap:
name: promtail-config
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containersINFO
The tolerations: [{ operator: Exists }] entry means Promtail tolerates all taints, including dedicated=system:NoSchedule on system nodes and dedicated=monitoring:NoSchedule on the monitoring node. This ensures logs are collected from every node in the cluster.
Promtail Configuration
ConfigMap: promtail-config
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: nx-watcher
data:
promtail.yaml: |
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://nx-loki.nx-watcher.svc.cluster.local:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
target_label: service
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
pipeline_stages:
- docker: {}Tempo (Production Only)
Grafana Tempo stores distributed traces sent by the OpenTelemetry Collector.
Deployment
StatefulSet: nx-tempo
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nx-tempo
namespace: nx-watcher
spec:
serviceName: nx-tempo
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: tempo
template:
spec:
nodeSelector:
node.kubernetes.io/pool: monitoring
tolerations:
- key: dedicated
operator: Equal
value: monitoring
effect: NoSchedule
containers:
- name: tempo
image: grafana/tempo:2.7.1
args:
- -config.file=/etc/tempo/tempo.yaml
ports:
- name: http
containerPort: 3200
- name: otlp-grpc
containerPort: 4317
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
volumeMounts:
- name: tempo-config
mountPath: /etc/tempo
- name: tempo-data
mountPath: /var/tempo
volumes:
- name: tempo-config
configMap:
name: tempo-config
volumeClaimTemplates:
- metadata:
name: tempo-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: csi-sc-vnpaycloud
resources:
requests:
storage: 10GiTempo Configuration
ConfigMap: tempo-config
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
namespace: nx-watcher
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
compactor:
compaction:
block_retention: 72h
metrics_generator:
registry:
external_labels:
source: tempo
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://nx-prometheus.nx-watcher.svc.cluster.local:9090/api/v1/writeOpenTelemetry Collector (Production Only)
The OTel Collector receives traces from Traefik and backend services via OTLP gRPC and forwards them to Tempo.
Deployment
Deployment: nx-otel-collector
apiVersion: apps/v1
kind: Deployment
metadata:
name: nx-otel-collector
namespace: nx-watcher
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: otel-collector
template:
spec:
nodeSelector:
node.kubernetes.io/pool: monitoring
tolerations:
- key: dedicated
operator: Equal
value: monitoring
effect: NoSchedule
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.118.0
args:
- --config=/etc/otel/otel-collector.yaml
ports:
- name: otlp-grpc
containerPort: 4317
- name: otlp-http
containerPort: 4318
- name: metrics
containerPort: 8888
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 300m
memory: 512Mi
volumeMounts:
- name: otel-config
mountPath: /etc/otel
volumes:
- name: otel-config
configMap:
name: otel-collector-configOTel Collector Configuration
ConfigMap: otel-collector-config
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: nx-watcher
data:
otel-collector.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 5s
limit_mib: 400
spike_limit_mib: 100
resource:
attributes:
- key: cluster
value: bana-production
action: upsert
exporters:
otlp/tempo:
endpoint: nx-tempo.nx-watcher.svc.cluster.local:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo]
telemetry:
metrics:
address: 0.0.0.0:8888OTel Service
Service: nx-otel-collector
apiVersion: v1
kind: Service
metadata:
name: nx-otel-collector
namespace: nx-watcher
spec:
selector:
app.kubernetes.io/name: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
- name: metrics
port: 8888
targetPort: 8888
type: ClusterIPTrace Sources
Traces are sent to the OTel Collector from two sources:
Traefik — configured via CLI args in the Traefik Deployment:
args:
- --tracing.otlp=true
- --tracing.otlp.grpc.endpoint=nx-otel-collector.nx-watcher.svc.cluster.local:4317
- --tracing.otlp.grpc.insecure=trueBackend services — configured via environment variables in the shared ConfigMap:
OTEL_EXPORTER_OTLP_ENDPOINT: http://nx-otel-collector.nx-watcher.svc.cluster.local:4317
OTEL_SERVICE_NAME: banaEach IGNIS service picks up the OTEL_EXPORTER_OTLP_ENDPOINT and auto-instruments HTTP requests, database queries, and Redis commands.
Alerting Rules
ConfigMap: prometheus-alerts
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alerts
namespace: nx-watcher
data:
alerts.yml: |
groups:
- name: bana
rules:
# Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace=~"nx-.*"}[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service)
/ sum(rate(traefik_service_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service }} has >5% error rate"
# PostgreSQL connections near limit
- alert: PostgreSQLHighConnections
expr: pg_stat_activity_count > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections above 80"
# Redis memory usage
- alert: RedisHighMemory
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage above 90%"
# Kafka under-replicated partitions
- alert: KafkaUnderReplicated
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Kafka has under-replicated partitions"
# Disk usage
- alert: HighDiskUsage
expr: |
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is above 85% usage"
# Identity service down
- alert: IdentityServiceDown
expr: up{job="nx-backend", service="identity"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Identity service is down — all auth will fail"