Skip to content

Operations

1. Deployment

PropertyValue
Imageregistry/nx-seller-sale:<tag>
Container Port3000
External Port31030
Snowflake ID3
Replicas (default)1 (dev) / 2+ (staging+)
Resources (req/lim)200m / 1 CPU, 512Mi / 2Gi memory
HPA targetCPU 70% (when scaled)
Migration modeRUN_MODE=migrate job before rollout; on-boot for dev
Live probeGET /v1/api/sale/healthz
Ready probeGET /v1/api/sale/readyz

Traefik routing labels

yaml
labels:
  - "traefik.enable=true"
  - "traefik.http.routers.sale.rule=PathPrefix(`/v1/api/sale`)"
  - "traefik.http.services.sale.loadbalancer.server.port=3000"

Required infrastructure

DependencyWhy
PostgreSQLPrimary datastore (schemas sale, allocation)
Kafka brokersMandatory — producer used for downstream notifications
RedisOptional — auth cache; service starts without it
@nx/identity reachableJWKS verification on every JWT
@nx/pricing reachableCheckout fails without pricing service
@nx/mq-pay reachableWebhook source — must be on same internal network
@nx/signal reachableWebSocket fanout

2. Observability

SignalSourceWhere to look
Logsstdout (IGNIS structured logger, key: %s format)kubectl logs deploy/sale / Loki
HealthGET /v1/api/sale/healthz, GET /readyzGateway portal
OpenAPI live specGET /v1/api/sale/doc/openapi.jsonGateway portal explorer
MetricsTraefik gateway :30800 (Prometheus scrape)Grafana — gateway dashboard

Key log fields

FieldSourceNotes
requestIdheader X-Request-IdPropagated cross-service
userIdJWT subject
merchantIdrequest scope
saleOrderIdservice operationsCritical for trace
kitchenTicketItemIdkitchen ticket flow
topic / partition / offsetKafka emit logs

Useful log queries

QuestionQuery
Webhook routing failureslevel=error AND PaymentWebhookService
Pricing service errors at checkoutlevel=error AND PricingNetworkService
Order add-item lock waitsSELECT FOR UPDATE AND wait
Kitchen status mismatchKITCHEN_TICKET_ITEM_STATUS_CHANGED AND unexpected

3. Security

ConcernMitigation
AuthNJWT (ES256, JWKS pulled from identity at boot + on-demand)
AuthZCasbin via PolicyDefinitionService; permissions cached in Redis (or in-memory)
Webhook trust/webhooks/payment has no auth — relies on Cilium network policy to allow only MQ-Pay → sale
Service-to-serviceBASIC strategy for cross-package calls
SecretsK8s Secret mounted as env (APP_ENV_DB_URL, APP_ENV_KAFKA_SASL_PASSWORD, etc.)
TLSTerminated at Nginx → Traefik → service in plaintext (intra-cluster)
Rate limitTraefik middleware (default 100 rps/IP); webhook endpoint may have separate quota
Network policyCilium — allow only gateway + Kafka + Postgres + Redis + identity + pricing + MQ-Pay
Soft-deletedeletedAt on all sale entities; archive instead of hard-delete
Race protectionSELECT ... FOR UPDATE on order rows during add-item, checkout, merge, split

4. Runbook

4.1 Alert classes

AlertTriggerCheckFixEscalate
SaleHighErrorRate5xx >5% over 5mkubectl logs deploy/sale | grep level=erroridentify failing endpoint; rollback last deploy if recenton-call backend
SaleWebhookFailureswebhook 5xx errors risinggrep PaymentWebhookControllercheck payload schema drift; check sister-service statuson-call backend + payment team
SaleCheckoutFailures/checkout 4xx/5xx spikegrep PricingNetworkService; check pricing service healthrestart pricing svc; expand pricing replicason-call backend
SaleKafkaProduceFailureKafkaProducer errors loggedbun -e 'cluster status'check broker health, SASL credson-call SRE
SaleStuckOrderorders in PROCESSING >1hDB query WHERE status='PROCESSING' AND processingAt < now() - interval '1h'check MQ-Pay attempts; manual cancel if staleon-call backend
SaleSessionLeakopen PosSession per device >24hDB querymanual close + reconciliationon-call ops
SaleAllocationStuckAllocationUsage.ACTIVE >24h post-cancelDB querymanual cancel cascadeon-call backend

4.2 Common operations

OperationCommand
Tail logskubectl logs -n <ns> -f deploy/sale
Run migrations manuallykubectl exec -it deploy/sale -- bun run migrate
Manually transition stuck orderDB UPDATE (after backup) — requires senior approval
Replay a webhook eventTrigger MQ-Pay redelivery via its admin API
Inspect WebSocket subscribers@nx/signal admin panel
Audit cancellation reasonSELECT cancellationReason, count(*) FROM "SaleOrder" WHERE cancelledAt > ... GROUP BY 1

4.3 Recovery scenarios

ScenarioRecovery
Service crash mid-checkoutOrder remains DRAFT (transaction rolled back); UI prompts retry
Webhook arrived but Kafka emit failed_enqueuePaymentSuccess logs the error; manual replay tool publishes from order snapshot. Inventory remains uncharged until replay.
Duplicate webhook redeliveryHandler idempotency: skips no-op if order already at target status
Stuck PROCESSING after MQ-Pay timeoutManually cancel → status CANCELLED → AllocationUsage cancels
Wrong total after merge/splitUse transferHistory jsonb on SaleOrderItem to audit; rollback via OrderMergeService.rollback or OrderSplitService reverse op
Lost POS sessioncloseRecountCount tracks recount attempts; manually close + reconcile

5. Cross-Service Runbook

For incidents that span multiple services, see central runbook/:

Proprietary and Confidential. Unauthorized copying, distribution, or use of this software is strictly prohibited.