A production-grade distributed e-commerce system built for CS6650 Distributed Systems. It demonstrates asynchronous microservices communication, fault injection, AWS cloud infrastructure, and message durability patterns.

Core business flow:
Customer → Browse Products → Add to Cart → Checkout (payment authorization) → Warehouse processes order asynchronously

Key design highlights:

  • Async order processing via RabbitMQ decouples checkout latency from warehouse processing
  • 10% payment decline rate is intentional — tests system behavior under partial failures
  • ALB Automatic Target Weights (ATW) demonstrates real-time anomaly detection and traffic shifting
  • Publisher Confirms + Manual ACK guarantees no order is silently lost

Tech Stack

LayerTechnologyWhy
LanguageJava 17Modern LTS, strong concurrency primitives
Web FrameworkSpring Boot 3.2.5Auto-configuration, REST controllers, DI
Message BrokerRabbitMQ 3Durable queues, Publisher Confirms, AMQP protocol
RabbitMQ Clientamqp-client 5.20.0Publisher Confirms, manual ACK, channel pool
Channel PoolCommons Pool2 2.12.0GenericObjectPool — reuse expensive Channel objects
Container RuntimeDocker + ECS FargateServerless containers, no EC2 management
Load BalancerAWS ALBPath-based routing, ATW anomaly detection
Service DiscoveryAWS Cloud MapPrivate DNS for internal service resolution
IaCTerraform 1.5+Reproducible AWS infrastructure
SerializationJacksonJSON encode/decode for order messages
Local DevDocker ComposeRun all 6 services locally

Architecture Decision Questions

Q: Why use RabbitMQ instead of direct service-to-service HTTP calls for the warehouse?

The warehouse processes orders asynchronously — there’s no reason the customer needs to wait for the warehouse to complete. Using RabbitMQ:

  • Decouples latency: checkout returns 200 as soon as the message is durably stored; warehouse processes at its own pace
  • Absorbs spikes: if the warehouse is slow, messages buffer in the queue instead of backing up into the shopping cart service
  • Fault isolation: if the warehouse crashes, messages stay in the queue and get redelivered when it recovers. With direct HTTP, those orders would be lost.

Q: Why use RabbitMQ over Kafka?

  • For this use case, RabbitMQ is a better fit:
    • Push-based delivery: RabbitMQ pushes messages to consumers (basicConsume). Kafka requires consumers to pull and track offsets manually
    • Per-message ACK: RabbitMQ’s manual ACK model fits our at-least-once requirement cleanly. With Kafka, you’d need to manage offset commits
    • Simpler ops: single broker, no Zookeeper needed for this scale
    • Kafka would be preferred for high-throughput event streaming or fan-out to multiple consumer groups

Q: Why does the Shopping Cart call the Credit Card Authorizer synchronously?

  • Payment authorization is a blocking business requirement — you cannot confirm a checkout without knowing if the card is approved. The customer is waiting for a real-time decision. Async payment would require a callback mechanism, adding complexity without benefit for this flow.

  • The trade-off is tight coupling: if CCA is down, checkout fails. We accept this because payment is in the critical path.

Q: Why does the Shopping Cart call CCA via the ALB instead of directly?
 - In this project it routes through the ALB for simplicity — one DNS for all services. But this is a known trade-off: CCA ends up publicly accessible, which is a security risk for a payment service.
 - The better production design is to register CCA with Cloud Map (cca.cs6650.local) the same way RabbitMQ is handled, so Shopping Cart calls it directly via private DNS. This keeps CCA internal-only and removes it from the ALB entirely.


Concurrency & Thread Safety

Q: How do you handle concurrent checkout requests in the Shopping Cart Service?

  • Cart and order state are stored in ConcurrentHashMap with AtomicInteger counters.
  • These are lock-free for reads and use fine-grained locking (CAS) for writes, supporting high throughput without a global lock.
  • The RabbitMQ channel pool (max 20) caps concurrent publishes. If request #21 arrives, it waits up to 5 seconds for a channel to be returned. This provides backpressure naturally.

Q: Why use a channel pool instead of one channel per request?

Creating a new RabbitMQ Channel is cheap but not free — it involves a network round-trip to the broker. More critically, enabling Publisher Confirms (confirmSelect()) and declaring the queue has setup cost.

With a pool of 20 channels:

  • Channels are created once, reused across thousands of requests
  • Pool validates channel.isOpen() on borrow — if a channel has been closed, the pool discards it and creates a new one
  • The pool enforces a cap on concurrent confirms, preventing the broker from being overwhelmed

Q: How does the Warehouse Consumer handle concurrent message processing?

10 threads each have a dedicated RabbitMQ Channel (channels are not thread-safe). Shared state is protected by atomic types:

  • ConcurrentHashMap<Integer, AtomicLong> for product quantities — lock-free per-key updates
  • AtomicLong orderCount — CAS increment, no lock

basicQos(10) per channel limits unacked messages per thread to 10, preventing any single thread from hoarding the queue.


Reliability & Messaging

Q: What is Publisher Confirms and why is it critical here?

By default, basicPublish is fire-and-forget — the broker might lose the message before writing it to disk. Publisher Confirms (channel.confirmSelect()) makes the broker send an ACK after durably persisting the message.

In Shopping Cart, waitForConfirmsOrDie(5000ms) blocks until the ACK arrives. If the broker NACKs or times out, it throws an exception — checkout returns 500 instead of silently succeeding with a lost order.

Without confirms: Cart returns 200 to the customer, but the order never reached the warehouse. Customer thinks checkout succeeded; warehouse never sees it.

With confirms: Cart only returns 200 if the message is durably stored. If anything fails, the customer gets an error and can retry.

Q: What is Manual ACK and what problem does it solve?

With autoAck=true, the broker deletes a message as soon as it’s delivered to the consumer — before the consumer has finished processing it. If the consumer crashes mid-processing, the message is gone.

With autoAck=false, the message stays in the queue until basicAck() is called. The warehouse only ACKs after successfully updating its state. If it crashes, the broker redelivers the message to another thread or on the next startup.

This gives at-least-once delivery — every order is processed at least once. The trade-off is potential duplicate processing if ACK is lost after processing. For this demo, duplicates are acceptable; in production you’d add idempotency keys.

Q: What happens if RabbitMQ crashes?

  • The queue is declared with durable=true and messages with PERSISTENT_TEXT_PLAIN. Both survive a broker restart.
  • During downtime, Shopping Cart’s waitForConfirmsOrDie(5s) will time out → checkout returns 500 → client retries
  • After restart, the warehouse reconnects and continues processing from the durable queue

Q: What happens if the Warehouse Consumer crashes mid-processing?

Since ACK is only sent after successful processing, the message is redelivered to another thread or on restart. This is the core guarantee of autoAck=false. Combined with basicQos(10), at most 10 messages per thread are “in flight” at any time — limiting potential redelivery scope.


System Design & Scalability

Q: What are the scalability bottlenecks in this system?

  1. Shopping Cart channel pool (max 20): caps concurrent checkout throughput. Fix: increase pool size or add more cart instances
  2. Single RabbitMQ broker: single point of failure and throughput ceiling. Fix: RabbitMQ cluster with quorum queues
  3. In-memory state: carts and products are lost on restart. Fix: persistent storage (Redis, PostgreSQL)
  4. 10 fixed warehouse threads: throughput is fixed. Fix: autoscaling consumer instances or increasing thread count

Q: How would you make the Warehouse Consumer idempotent?

Add an orderId to each message (already present). Before processing, check a processedOrders set (backed by Redis or a DB). If the orderId is already there, skip processing and ACK. This prevents duplicate quantity updates from redelivered messages.

Q: What consistency guarantees does this system provide?

  • Checkout → RabbitMQ: exactly-once write via Publisher Confirms + idempotent publish
  • RabbitMQ → Warehouse: at-least-once delivery via manual ACK
  • Warehouse state: eventual consistency — quantities in ConcurrentHashMap are eventually accurate after all messages are processed
  • No global transaction: if CCA approves but RabbitMQ times out, customer gets 500. The card was not charged (CCA is a simulation here), so no double-charge risk

API Design

Q: Walk me through what happens when a checkout request fails with 10% of requests.

The 10% decline is triggered by the CCA returning 402. The flow:

  1. Cart validates card format → OK
  2. Cart calls CCA via ALB → CCA uses ThreadLocalRandom to decide: 10% of the time returns 402
  3. Cart receives 402 from CCA → immediately returns 402 to the client, without publishing to RabbitMQ
  4. No message is enqueued → no warehouse processing → correct behavior (declined cards don’t create orders)

This is intentional to demonstrate partial failure handling — the system correctly rejects 10% without affecting the 90% success path.

Q: Why return 402 (Payment Required) instead of 400 (Bad Request) for declined cards?

400 means the request itself is malformed. The request is valid — the card format is correct — but the payment was declined. 402 is the semantic HTTP status for payment failure. It signals to the client that they should try a different payment method, not fix a request format issue.

Q: What’s the difference between a 400 and a 402 in this system?

  • 400: Request is invalid (bad card format, missing fields, invalid cart ID). Client should fix the request.
  • 402: Request is valid but payment was declined. Client should try another card.
  • 404: Cart not found or cart is empty. Client should create a cart first.
  • 500: Internal error (RabbitMQ timeout, unexpected exception). Client can retry.

Cloud & Infrastructure

Q: What is ATW and how does it work?

ATW (Automatic Target Weights) is an AWS ALB feature that monitors the 5XX error rate per registered target and automatically reduces traffic to anomalous instances.

Configuration requirements:

  • load_balancing_algorithm_type = "weighted_random" (round_robin doesn’t support dynamic weights)
  • load_balancing_anomaly_mitigation = "on"

In this project, the bad Product Service instance returns 503 50% of the time. ALB detects the elevated 5XX rate, marks it as anomalous, and gradually reduces its weight from 33% to ~8% over ~2 minutes — without removing it from the target group.

unhealthy_threshold = 10 is set intentionally high so the instance stays registered (for demo visibility). In production you’d set this to 2-3 to quickly remove bad instances.

Q: Why use ECS Fargate instead of EC2?

Fargate is serverless containers — AWS manages the underlying host. Benefits:

  • No EC2 instance management, patching, or right-sizing
  • Pay per task CPU/memory, not per idle instance
  • Tasks launch in ~30 seconds (fine for load tester on-demand runs)
  • Simpler security model: no SSH access needed

Trade-off: slightly higher per-unit cost vs reserved EC2, and less control over networking/host configuration.

Q: What is Cloud Map and why use it for RabbitMQ?

AWS Cloud Map provides private DNS inside a VPC. ECS tasks can reach rabbitmq.cs6650.local:5672 without knowing the actual IP address, which changes every time the Fargate task restarts.

Alternative: hard-code the ECS task’s public IP — but that changes on every deployment. Cloud Map gives a stable internal hostname.

Q: How does path-based routing work in this ALB setup?

The ALB has one listener on port 80 with three forwarding rules evaluated in priority order:

  1. Path matches /product* → forward to Product target group (3 instances, ATW)
  2. Path matches /credit-card-authorizer/* → forward to CCA target group
  3. Path matches /shopping-carts* → forward to Cart target group

This means a single ALB DNS serves all four services. When Shopping Cart needs to call CCA, it POSTs to http://{ALB_DNS}/credit-card-authorizer/authorize — the ALB routes it correctly based on the path.


Behavioral / Situational

Q: Why did you build this project?

This was a CS6650 (Scalable Distributed Systems) assignment designed to teach production-grade distributed systems patterns. I wanted to go beyond the assignment by deploying to real AWS infrastructure with Terraform, implementing proper durability guarantees (Publisher Confirms + manual ACK), and demonstrating AWS-native features like ATW.

Q: What would you do differently in production?

  1. Persistent storage: replace in-memory ConcurrentHashMap with Redis (cart state) and PostgreSQL (orders, products)
  2. Idempotency: add deduplication in the Warehouse Consumer using Redis SET NX on orderId
  3. Retry with backoff: add exponential backoff on the Shopping Cart → CCA HTTP call
  4. RabbitMQ cluster: quorum queues for HA, instead of single broker
  5. Distributed tracing: add trace IDs (OpenTelemetry) across Cart → CCA → RabbitMQ → Warehouse for end-to-end visibility
  6. Schema registry: use Avro or Protobuf for the order message schema instead of raw JSON

Q: How did you test this system?

  • Unit tests: validation logic in each service
  • Integration tests: bash scripts (test-local.sh, test-aws.sh) testing all 12 API endpoints with curl — create product, create cart, add item, checkout, verify responses
  • Load test: custom Java load tester firing 200k checkout requests concurrently, measuring throughput and success/failure rates
  • ATW demo: 50k GET /product requests to trigger and observe the weight shift in the ALB console