Infrastructure Overview
BANA uses separate Kubernetes clusters for staging and production on VNPAY Cloud. Both share the same Kustomize base but differ significantly in architecture maturity.
| Staging | Production | |
|---|---|---|
| Purpose | Internal testing, demos, integration env | Live traffic |
| Domain | sgw.staging.bana.com.vn (API), *.staging.bana.com.vn (frontend) | TBD |
| Nodes | 6 (3 control-plane managed by VNPAY + 2 default + 1 stateful) | 7+ (2 system + 3 app + 2 stateful) |
| Ingress | nginx-ingress (single) | nginx-ingress (HA pair) |
| API Gateway | Traefik (single) | Traefik (HA pair) |
| CD | Manual kubectl apply | GitLab CI/CD pipelines |
| Secrets | Manual (create-secrets.sh) | Sealed Secrets |
| TLS | cert-manager with Let's Encrypt | cert-manager |
| Scaling | Manual | HPA + PDB + topology spread |
| Monitoring | Prometheus + Grafana + Loki | Prometheus + Grafana + Loki + Tempo + OTel |
Traffic Flow
The architecture separates edge ingress (nginx-ingress) from API gateway (Traefik). Traefik is not an ingress controller — it's a backend API gateway with rate limiting, circuit breaking, and security headers.
Design Principles
| Principle | Decision | Rationale |
|---|---|---|
| Separate clusters | Staging + Production | Zero blast radius, independent scaling |
| Edge vs Gateway | nginx-ingress (edge) + Traefik (API gateway) | Separation of concerns — TLS/static vs API middleware |
| CI/CD | GitLab CI/CD + GitLab Container Registry | Team already uses GitLab |
| TLS | cert-manager with Let's Encrypt | Auto-renewal, no manual cert management |
| Config | Kustomize overlays | Same base, different overlays per environment |
| Secrets | Manual create-secrets.sh (staging) / Sealed Secrets (production) | Staging uses manual creation, production uses Git-safe encryption |
| Identity-first | Init containers wait for identity | All VerifierApps need JWKS from IssuerApp |
| Migrations | K8s Jobs before deployment | Separate from runtime, idempotent |
| Payment split | 2 Deployments from 1 image | api, worker via APP_MODE env |
| Signal dual-route | REST with middleware, WebSocket without | No rate-limit on persistent connections |
Documentation Structure
| Page | Description |
|---|---|
| Multi-Tenancy (PRD) | Tenant isolation tiers (Pool / Bridge / Silo) and tenant migration strategy |
| Decisions | Cross-cutting infrastructure ADRs |
| Cluster Design | Node pools, namespaces — staging vs production |
| Workloads | Every Deployment/StatefulSet spec |
| Networking | nginx-ingress + Traefik API gateway, TLS, routing |
| Data Layer | PostgreSQL, Redis, Kafka, Typesense |
| Configuration | ConfigMaps, Secrets, env var mapping |
| Observability | Prometheus, Grafana, Loki, Tempo, OpenTelemetry |
| Security & Hardening | Pod security, RBAC, image supply chain, PriorityClass |
| Operations | GitLab CI/CD, migrations, deployment procedures |