Operations
1. Deployment
| Property | Value |
|---|---|
| Image | registry/nx-seller-sale:<tag> |
| Container Port | 3000 |
| External Port | 31030 |
| Snowflake ID | 3 |
| Replicas (default) | 1 (dev) / 2+ (staging+) |
| Resources (req/lim) | 200m / 1 CPU, 512Mi / 2Gi memory |
| HPA target | CPU 70% (when scaled) |
| Migration mode | RUN_MODE=migrate job before rollout; on-boot for dev |
| Live probe | GET /v1/api/sale/healthz |
| Ready probe | GET /v1/api/sale/readyz |
Traefik routing labels
yaml
labels:
- "traefik.enable=true"
- "traefik.http.routers.sale.rule=PathPrefix(`/v1/api/sale`)"
- "traefik.http.services.sale.loadbalancer.server.port=3000"Required infrastructure
| Dependency | Why |
|---|---|
| PostgreSQL | Primary datastore (schemas sale, allocation) |
| Kafka brokers | Mandatory — producer used for downstream notifications |
| Redis | Optional — auth cache; service starts without it |
@nx/identity reachable | JWKS verification on every JWT |
@nx/pricing reachable | Checkout fails without pricing service |
@nx/mq-pay reachable | Webhook source — must be on same internal network |
@nx/signal reachable | WebSocket fanout |
2. Observability
| Signal | Source | Where to look |
|---|---|---|
| Logs | stdout (IGNIS structured logger, key: %s format) | kubectl logs deploy/sale / Loki |
| Health | GET /v1/api/sale/healthz, GET /readyz | Gateway portal |
| OpenAPI live spec | GET /v1/api/sale/doc/openapi.json | Gateway portal explorer |
| Metrics | Traefik gateway :30800 (Prometheus scrape) | Grafana — gateway dashboard |
Key log fields
| Field | Source | Notes |
|---|---|---|
requestId | header X-Request-Id | Propagated cross-service |
userId | JWT subject | — |
merchantId | request scope | — |
saleOrderId | service operations | Critical for trace |
kitchenTicketItemId | kitchen ticket flow | |
topic / partition / offset | Kafka emit logs |
Useful log queries
| Question | Query |
|---|---|
| Webhook routing failures | level=error AND PaymentWebhookService |
| Pricing service errors at checkout | level=error AND PricingNetworkService |
| Order add-item lock waits | SELECT FOR UPDATE AND wait |
| Kitchen status mismatch | KITCHEN_TICKET_ITEM_STATUS_CHANGED AND unexpected |
3. Security
| Concern | Mitigation |
|---|---|
| AuthN | JWT (ES256, JWKS pulled from identity at boot + on-demand) |
| AuthZ | Casbin via PolicyDefinitionService; permissions cached in Redis (or in-memory) |
| Webhook trust | /webhooks/payment has no auth — relies on Cilium network policy to allow only MQ-Pay → sale |
| Service-to-service | BASIC strategy for cross-package calls |
| Secrets | K8s Secret mounted as env (APP_ENV_DB_URL, APP_ENV_KAFKA_SASL_PASSWORD, etc.) |
| TLS | Terminated at Nginx → Traefik → service in plaintext (intra-cluster) |
| Rate limit | Traefik middleware (default 100 rps/IP); webhook endpoint may have separate quota |
| Network policy | Cilium — allow only gateway + Kafka + Postgres + Redis + identity + pricing + MQ-Pay |
| Soft-delete | deletedAt on all sale entities; archive instead of hard-delete |
| Race protection | SELECT ... FOR UPDATE on order rows during add-item, checkout, merge, split |
4. Runbook
4.1 Alert classes
| Alert | Trigger | Check | Fix | Escalate |
|---|---|---|---|---|
SaleHighErrorRate | 5xx >5% over 5m | kubectl logs deploy/sale | grep level=error | identify failing endpoint; rollback last deploy if recent | on-call backend |
SaleWebhookFailures | webhook 5xx errors rising | grep PaymentWebhookController | check payload schema drift; check sister-service status | on-call backend + payment team |
SaleCheckoutFailures | /checkout 4xx/5xx spike | grep PricingNetworkService; check pricing service health | restart pricing svc; expand pricing replicas | on-call backend |
SaleKafkaProduceFailure | KafkaProducer errors logged | bun -e 'cluster status' | check broker health, SASL creds | on-call SRE |
SaleStuckOrder | orders in PROCESSING >1h | DB query WHERE status='PROCESSING' AND processingAt < now() - interval '1h' | check MQ-Pay attempts; manual cancel if stale | on-call backend |
SaleSessionLeak | open PosSession per device >24h | DB query | manual close + reconciliation | on-call ops |
SaleAllocationStuck | AllocationUsage.ACTIVE >24h post-cancel | DB query | manual cancel cascade | on-call backend |
4.2 Common operations
| Operation | Command |
|---|---|
| Tail logs | kubectl logs -n <ns> -f deploy/sale |
| Run migrations manually | kubectl exec -it deploy/sale -- bun run migrate |
| Manually transition stuck order | DB UPDATE (after backup) — requires senior approval |
| Replay a webhook event | Trigger MQ-Pay redelivery via its admin API |
| Inspect WebSocket subscribers | @nx/signal admin panel |
| Audit cancellation reason | SELECT cancellationReason, count(*) FROM "SaleOrder" WHERE cancelledAt > ... GROUP BY 1 |
4.3 Recovery scenarios
| Scenario | Recovery |
|---|---|
| Service crash mid-checkout | Order remains DRAFT (transaction rolled back); UI prompts retry |
| Webhook arrived but Kafka emit failed | _enqueuePaymentSuccess logs the error; manual replay tool publishes from order snapshot. Inventory remains uncharged until replay. |
| Duplicate webhook redelivery | Handler idempotency: skips no-op if order already at target status |
| Stuck PROCESSING after MQ-Pay timeout | Manually cancel → status CANCELLED → AllocationUsage cancels |
| Wrong total after merge/split | Use transferHistory jsonb on SaleOrderItem to audit; rollback via OrderMergeService.rollback or OrderSplitService reverse op |
| Lost POS session | closeRecountCount tracks recount attempts; manually close + reconcile |
5. Cross-Service Runbook
For incidents that span multiple services, see central runbook/:
6. Related Pages
- Configuration
- API Events — Kafka topic constants for replay commands
- Integration — sister-service network
- Decisions