Asynchronous processing lies at the foundation of scalable, resilient distributed systems. Yet the space is fragmented: message queues vs event streams, at-least-once vs exactly-once semantics, fire-and-forget vs request-reply patterns, and a toolbox of brokers (Kafka, RabbitMQ, SQS, NATS) each optimized for different workloads. This guide untangles the architecture from first principles, grounding each pattern in the underlying guarantees, tradeoffs, and real-world deployment choices.
Architecture at a glance





Why Asynchrony Matters: Decoupling Time and Failure
Synchronous systems are coupled in three dimensions:
- Temporal coupling: Producer must wait for consumer response. Network latency and consumer processing time directly impact producer throughput.
- Failure coupling: Consumer unavailability blocks the producer. A single slow or crashed consumer cascades upstream.
- Scaling coupling: Adding consumers requires redeploying the producer; they must know each other’s location and protocol.
Asynchronous systems decouple these:
- Temporal decoupling: Producer publishes and continues; consumer processes at its own pace. A delay in the queue is not a delay in the producer.
- Failure isolation: Consumer crash does not affect producer. Messages accumulate in the queue until the consumer recovers.
- Scaling decoupling: Add or remove consumers independently. The broker handles discovery and delivery.
At its core, async processing trades latency for throughput, resilience, and operational flexibility. The cost: eventual consistency, complexity of distributed state, and the need for idempotent processing.
Foundational Concept: The Broker as State Machine
A message broker is fundamentally a state machine that:
- Accepts messages from producers
- Durably persists them (in memory, disk, or replicated state)
- Routes them to consumers based on subscriptions or queue bindings
- Tracks consumer progress (offsets, acknowledgments)
- Retains, reorders, or deletes messages based on policy
The broker’s role separates producer and consumer in space (different processes), time (buffering), and protocol (brokers can translate).
Two architectural families emerge from this base: message queues (traditional task queues) and event streams (append-only logs).

Message Queues: Point-to-Point Task Distribution
A message queue (RabbitMQ, NATS, AWS SQS) binds:
– One logical producer (may be multiple processes) sends to a queue
– One or more consumers pop messages from the queue
– Message ownership: A message is owned by one consumer at a time
– Deletion: Once a consumer acknowledges (ACKs), the message is removed
Lifecycle example (RabbitMQ with durable queues):
Producer → Broker Queue (disk-persisted) → Consumer 1: ACK → Deleted
↘ Consumer 2: ACK → Deleted
Each message is delivered to one consumer (load-balanced if multiple). This pattern is ideal for task distribution: order processing, image resizing, background jobs.
Event Streams: Replay and Multi-Consumer
An event stream (Kafka, Pulsar, AWS Kinesis) fundamentally differs:
- Append-only log: Events are never deleted (until retention policy expires)
- Offset tracking: Each consumer maintains its own position in the log
- Replay: Consumers can rewind and reprocess from any offset
- Multi-consumer: All subscribers see all events (at their own pace)
Lifecycle example (Kafka):
Producer → Broker Log (offset 0, 1, 2, 3, ...) → Consumer A (reads offset 0-N)
↘ Consumer B (reads offset M-X)
↘ Consumer C (rewinds, replays)
The key difference: queues optimize for producer-driven delivery; streams optimize for consumer-driven reading and temporal flexibility.
Async Patterns: Four Fundamental Designs
Beyond the broker type, the interaction pattern defines what the producer and consumer each know and when.

1. Fire-and-Forget (One-Way)
Semantics: Producer publishes a message and does not await a response.
# Kafka example
producer.send("user.registered", {"user_id": 123, "email": "alice@example.com"})
# Returns immediately; no callback
Assumptions:
– The broker will persist the message
– The consumer will eventually process it
– The producer does not care about success or failure
Coupling: Very loose. Ideal for notifications, analytics events, audit logs.
Risk: Failure is silent. No feedback if the consumer crashes or rejects the message.
2. Request-Reply (Synchronous over Async)
Semantics: Producer sends a message and blocks, waiting for a reply.
# RabbitMQ RPC pattern
response = rpc_client.call("calculate", {"x": 10, "y": 20}, timeout=5s)
print(response) # {"result": 30}
Implementation mechanics:
– Producer creates a unique reply_to queue for itself
– Producer sends message with correlation_id and reply_to address
– Consumer processes the message, sends reply to reply_to
– Producer waits on its reply queue
Coupling: Producer and consumer are tightly coupled through the broker. No temporal decoupling.
Trade-off: You’ve added a network hop and broker latency compared to a direct RPC, but gained resilience (if the consumer crashes mid-request, the broker queues the reply).
3. Async Request-Reply (Callback Pattern)
Semantics: Producer sends a message and registers a callback; consumer replies asynchronously.
# Kafka + offset tracking (consumer returns result to different topic)
producer.send("orders.process", {"order_id": 123})
# Producer doesn't wait; instead listens on "orders.response"
# Consumer processes, sends reply to "orders.response" with correlation_id
Coupling: Looser. Producer and consumer don’t block each other.
Complexity: Producer must manage correlation IDs, handle timeouts, and poll the response topic.
4. Pub-Sub (One-to-Many Event Distribution)
Semantics: One event is published; multiple independent consumers receive it.
# Kafka topic with multiple consumer groups
# Topic: user.registered
# Consumer Group A (email service) processes for sending welcome email
# Consumer Group B (analytics) logs for funnel analysis
# Consumer Group C (crm) syncs to Salesforce
Coupling: Very loose. Each consumer is independent; new consumers can subscribe retroactively (if retention allows).
Strengths: Naturally supports fan-out, decouples business domains, enables event-driven architecture.
Delivery Guarantees: The Cost of Certainty
Every broker claims to offer reliability, but what does “reliable” actually mean? The answer lies in three canonical semantics.

At-Most-Once (Fire-and-Forget)
Guarantee: Each message is delivered zero or one time.
Implementation: Producer sends, broker acknowledges receipt, no further retries.
Failure mode:
– Network partition between producer and broker: message lost
– Broker crashes before persisting: message lost
Latency: Sub-millisecond (no wait for durability).
Use cases: Metrics, non-critical telemetry, monitoring (losing a few data points acceptable).
Example (SQS):
Producer sends → SQS ingests → immediately ACKs → Consumer polls
At-Least-Once (Default in Kafka)
Guarantee: Each message is delivered one or more times.
Implementation:
1. Producer sends message
2. Broker persists to disk/replicas
3. Broker ACKs producer
4. Consumer processes message
5. Consumer sends ACK (offset commit)
6. Broker marks offset as consumed
Failure mode:
– Consumer crashes after processing, before ACK: broker retransmits on consumer restart
– Consumer processes, commits offset, crashes before external side-effect: side-effect not repeated (depends on consumer implementation)
Latency: Several milliseconds (wait for fsync and replica acknowledgment).
Idempotency requirement: Consumers must be idempotent. If a message is redelivered, processing it again must yield the same final state.
Example (Kafka with replication):
# Producer
producer.send("orders", {"order_id": 123}, acks='all') # Wait for all replicas
# Consumer
message = consumer.poll()
process_order(message)
consumer.commit() # Offset commit
Exactly-Once (Transactional)
Guarantee: Each message is processed exactly once, with all side effects atomic.
Implementation: Transactional markers, idempotency keys, and coordinated consumer state.
# Kafka transactional producer
producer.begin_transaction()
producer.send("orders", {"order_id": 123})
producer.send("inventory", {"sku": "A1", "qty": -1})
producer.commit_transaction()
# Broker ensures: either both messages are visible or neither
For consumers, exactly-once is harder. You must ensure:
1. Processing is idempotent (or uses idempotency keys)
2. State is committed atomically with the offset
Example pattern (Kafka Streams):
Input message → Process → Update local state + Commit offset (one transaction)
Cost: 2-3x latency due to transactional overhead, logging, and coordination.
When to use:
– Financial transactions (no double-charging)
– Inventory management (no overselling)
– Exactly-once is often not worth the cost; at-least-once + idempotency is pragmatic.
Dead Letter Queues: Handling Poison Pills
In production, not every message succeeds. A consumer may reject 0.1% of messages due to:
- Deserialization error (malformed JSON)
- Business logic rejection (order from banned customer)
- Transient failure (rate-limited third-party API)
Without a dead letter queue (DLQ), toxic messages loop forever:
Broker → Consumer rejects → Broker requeues (retry) → Consumer rejects → ...
A DLQ is a separate queue (or topic) that collects messages that fail processing after N retries.
Typical flow (RabbitMQ/SQS):
consumer.poll()
try:
process_message(msg)
consumer.ack(msg)
except Exception as e:
if msg.retry_count < 3:
consumer.nack_with_requeue(msg) # Requeue, increment count
else:
dlq_producer.send("orders-dlq", msg) # Send to DLQ
consumer.ack(msg) # Remove from original queue
DLQ best practices:
- Separate topic/queue per source queue (e.g.,
orders-dlq,payments-dlq) - Include metadata: original message, error reason, timestamp, retry count
- Monitor and alert on DLQ depth (spike indicates systemic failure)
- Manual remediation process: on-call engineer reviews, fixes data, resubmits to original queue
- Logging and tracing for root cause analysis
Example (Kafka):
# Topic: orders
# DLQ Topic: orders-dlq
# Consumer group: order-processor
# On 3 failures: send to orders-dlq with error tag
Back-Pressure and Flow Control: Preventing the Cascade
Back-pressure is the mechanism by which a slow consumer tells the producer (or broker) to slow down.
The Problem: Unbounded Queues
Without back-pressure, a fast producer floods the broker, and the broker floods memory (or disk).
Fast Producer → Broker (memory growing) → Slow Consumer
↑
Memory OOM
Solutions
1. Consumer-side throttling (pull-based systems like Kafka):
# Consumer manually controls batch size
records = consumer.poll(max_records=100, timeout_ms=1000)
# Broker doesn't push more than 100 records
process_batch(records)
Advantage: Consumer sets its own pace.
2. Producer-side rate limiting:
# Explicitly limit outgoing rate
producer.send(msg, rate_limit_rps=1000)
# Producer waits if it would exceed the limit
3. Broker-side capacity and eviction:
RabbitMQ offers max-length and overflow policies:
# Queue config
x-max-length: 1000000 # Drop oldest if exceeds 1M messages
x-overflow: reject # Or reject new messages instead
4. Acknowledgment-based flow control:
Producer sends → Broker stores → Consumer ACKs (slowly)
↑
Broker tells producer to slow down (credit-based)
In AMQP and some Kafka configurations, the broker tracks credits per consumer. Once credits are exhausted, the broker pauses delivery.
Example (RabbitMQ manual ACK with prefetch):
channel.basic_qos(prefetch_count=10) # Only 10 unacked messages in flight
msg = channel.basic_get("queue")
# Process msg
channel.basic_ack(msg.delivery_tag) # ACK allows next message
Best practice: Set prefetch_count to balance memory and throughput. Too low (e.g., 1) causes underutilization; too high (e.g., 10000) causes memory bloat.
Saga Pattern for Distributed Transactions
In a monolithic system, a multi-step business process (e.g., order → reserve inventory → charge payment → create shipment) is wrapped in a database transaction. In microservices, each step may be a separate service, and there’s no distributed ACID transaction.
The Saga pattern coordinates distributed work across services using either orchestration or choreography.

Orchestration-Based Saga
A Saga Orchestrator (e.g., Temporal, Cadence, Netflix Conductor) explicitly controls the flow:
# Pseudocode (Temporal)
@workflow
async def order_saga(order_id):
try:
inventory_result = await activities.reserve_inventory(order_id)
payment_result = await activities.charge_payment(order_id)
shipping_result = await activities.create_shipment(order_id)
return {"status": "success"}
except Exception as e:
# Compensating transactions (rollback)
await activities.release_inventory(order_id)
await activities.refund_payment(order_id)
return {"status": "failed", "error": str(e)}
Advantages:
– Explicit, easy to understand control flow
– Centralized error handling and compensation logic
– Built-in retry and timeout policies
Disadvantages:
– Orchestrator is a bottleneck (centralized state)
– Coupling: orchestrator must know about all services
– Requires infrastructure (Temporal, Cadence) or custom code
Choreography-Based Saga (Event-Driven)
Services react to events; no orchestrator.
# Order Service publishes OrderCreated
order_service.publish("OrderCreated", {"order_id": 123})
# Inventory Service listens for OrderCreated, publishes InventoryReserved
@event_handler("OrderCreated")
def on_order_created(event):
reserve_inventory(event.order_id)
publish("InventoryReserved", {"order_id": event.order_id})
# Payment Service listens for InventoryReserved, publishes PaymentCharged
@event_handler("InventoryReserved")
def on_inventory_reserved(event):
charge_payment(event.order_id)
publish("PaymentCharged", {"order_id": event.order_id})
# On failure, publish compensating events
@event_handler("PaymentFailed")
def on_payment_failed(event):
release_inventory(event.order_id)
publish("InventoryReleased", {"order_id": event.order_id})
Advantages:
– Loose coupling: services only know about events, not each other
– Scales horizontally: new services can subscribe to events without changing others
– Naturally event-sourced
Disadvantages:
– Implicit control flow (harder to reason about the happy path and failure scenarios)
– Debugging is complex (distributed across services)
– No global rollback; each service must implement compensating logic
Hybrid Approach
Many production systems combine both: orchestrator for critical, stateful flows (orders, payments) and choreography for supporting tasks (notifications, analytics).
Idempotency: The Foundation of Resilience
Whether using orchestration or choreography, all saga steps must be idempotent. If a step is retried (due to network timeout, crash, or recovery), it must produce the same outcome.
Idempotency patterns:
- Idempotency key (preferred):
“`python
# Client generates unique ID
order_id = “order-123”
idempotency_key = uuid.uuid4()
# Service stores idempotency key + result
result = db.get_idempotency_cache(idempotency_key)
if result:
return result # Cached response
result = process_order(order_id)
db.cache_idempotency(idempotency_key, result)
return result
“`
-
Natural idempotency (set operations):
python
# Idempotent: setting a value to the same value is safe
db.set(user_id, {"status": "verified"})
# Can be called 1 or N times; same result -
Temporal idempotency (check before act):
python
if not db.exists("payment", payment_id):
db.create("payment", payment_id, amount)
# Duplicate calls see existing payment; no double-charge
Broker Selection: Kafka vs RabbitMQ vs SQS vs NATS
Each broker excels in different scenarios. There’s no universal choice; trade-offs depend on throughput, latency, durability, operational complexity, and cost.

Kafka: The Event Streaming Platform
Architecture: Distributed, partitioned append-only log. Multiple replicas per partition.
Throughput: 1M+ messages per second per broker (clusters scale horizontally).
Durability: Configurable (at-least-once to exactly-once). Replicated to multiple brokers.
Consumer semantics: Pull-based. Consumer maintains offset. Can rewind and replay.
Retention: Time-based or size-based policy. Default: 7 days.
Ordering: Within a partition, total order. Across partitions, no guarantee.
Latency: 10-100ms end-to-end (not real-time; suitable for batch and stream analytics).
Cost: Self-hosted (free software, but ops burden) or managed (Confluent Cloud, AWS MSK).
Best for:
– Event sourcing (append-only audit trail)
– Real-time analytics (streaming joins, aggregations)
– Data pipeline fan-out (one source, many consumers)
– High-volume workloads (billions of messages/day)
Example use case:
IoT sensors → Kafka → Stream processor (Flink, Spark) → Data warehouse
→ Kafka → Real-time dashboard
RabbitMQ: The Messaging Broker
Architecture: Lightweight message broker. Messages routed via exchanges and bindings. Queues can be durable (persisted to disk).
Throughput: 50k–200k messages per second (single node; clusters available).
Durability: Durable queues persisted; replication available (RabbitMQ 3.8+).
Consumer semantics: Push-based. Broker sends messages; consumer acknowledges.
Retention: Explicit TTL or infinite (until consumed).
Ordering: Per queue (not across queues).
Latency: 1-5ms (very low).
Cost: Self-hosted (free; ops burden) or managed (CloudAMQP, AWS MQ).
Best for:
– Task queues (image resizing, email sending, background jobs)
– RPC-style request-reply patterns
– Workflows and orchestration
– Complex routing (multiple exchanges, headers-based routing)
Example use case:
Web app → RabbitMQ → Worker pool (10s of workers) → Process jobs
→ RabbitMQ → Logging/monitoring services
AWS SQS: The Managed Queue
Architecture: Fully managed (no infrastructure). Simple FIFO or standard queues.
Throughput: 300k messages per second per queue (with batching).
Durability: AWS-managed (no explicit configuration).
Consumer semantics: Polling (consumer fetches messages). Visibility timeout (message hidden during processing; re-queued on timeout).
Retention: 1 minute to 14 days (configurable).
Ordering: Standard queue (best-effort); FIFO queue (strict order, ~3k msgs/sec).
Latency: 20-50ms (includes polling overhead).
Cost: Pay per million requests (scales with usage). No fixed ops cost.
Best for:
– Simple, low-complexity workloads
– AWS-native architectures
– Decoupling Lambda functions or ECS tasks
– Avoiding ops burden
Example use case:
API Gateway → Lambda (process request) → SQS → Lambda workers (auto-scale)
NATS: Ultra-High Performance
Architecture: Lightweight pub-sub and message queue. In-memory (with optional persistence via JetStream).
Throughput: 15M+ messages per second (single node).
Durability: Default: in-memory (fast). JetStream (2.0+) adds persistence.
Consumer semantics: Push-based. Also supports pull (JetStream).
Retention: Optional (JetStream). Default: fire-and-forget.
Ordering: Per subject; no ordering across subjects.
Latency: Sub-millisecond.
Cost: Open-source (free). Self-hosted. Managed version (Synadia) available.
Best for:
– Microservices internal communication
– Edge computing and IoT (lightweight)
– Ultra-low latency requirements
– High-frequency trading, real-time gateways
Example use case:
Edge gateway (NATS) → IoT sensors, local microservices → Cloud (NATS bridge)
Decision Framework
| Requirement | Kafka | RabbitMQ | SQS | NATS |
|---|---|---|---|---|
| Event replay & audit trail | ✓✓ | ✗ | ✗ | ✓ (JetStream) |
| Complex routing (headers, routing keys) | ✗ | ✓✓ | ✗ | ✗ |
| Request-reply (RPC) | ~ | ✓✓ | ~ | ✓ |
| Ultra-low latency (<1ms) | ✗ | ~ | ✗ | ✓ |
| Managed (no ops) | ~ | ~ | ✓ | ✗ |
| Massive throughput (1M+ msgs/sec) | ✓ | ~ | ✓ | ✓ |
| Simple to operate | ~ | ~ | ✓ | ✓ |
Idempotency Design: Making Retries Safe
Retries are inevitable in distributed systems. Network timeouts, transient failures, and service restarts all trigger retransmission. Without idempotency, retries cause duplicate work.
The Idempotency Key Pattern
Core idea: Client generates a unique ID for each logical operation. Service memoizes the result keyed by this ID.
# Client
order_data = {"user_id": 123, "items": [...]}
idempotency_key = uuid.uuid4() # e.g., "550e8400-e29b-41d4-a716-446655440000"
response = requests.post(
"/orders",
json=order_data,
headers={"Idempotency-Key": idempotency_key}
)
# If timeout: retry with same idempotency_key
if response.status_code == 503:
response = requests.post(
"/orders",
json=order_data,
headers={"Idempotency-Key": idempotency_key}
)
Server-side implementation:
@app.post("/orders")
def create_order(request: OrderRequest):
idempotency_key = request.headers.get("Idempotency-Key")
# Check cache
cached = db.get_idempotency_cache(idempotency_key)
if cached:
return cached # Return cached response
# Process the order
order = db.create_order(request.order_data)
# Cache the response
db.cache_idempotency(idempotency_key, order, ttl=24h)
return order
Database-Side Idempotency (Upsert)
Some operations are naturally idempotent if you use database upserts:
-- Idempotent: duplicate inserts with unique constraint become upserts
INSERT INTO orders (order_id, user_id, total)
VALUES (123, 456, 99.99)
ON CONFLICT (order_id) DO NOTHING;
-- Multiple executions have the same effect
Idempotency Retention and Cleanup
Challenge: How long do you cache idempotency results?
- Too short: Client retry falls outside cache window; duplicate processing.
- Too long: Memory bloat.
Recommended: 24-48 hours for most systems. Longer for financial transactions (weeks).
# TTL-based cleanup (Redis)
db.setex(f"idempotency:{key}", 86400, result) # 24-hour TTL
Practical Deployment: Putting It Together
Example: E-Commerce Order Processing
System components:
– Order API (stateless, scalable)
– Order Queue (RabbitMQ)
– Order Processor (consumes from queue)
– Inventory Service, Payment Service (async communication)
Flow:
1. POST /orders → Order API
2. API validates, publishes OrderCreated event
3. API returns 202 Accepted (async)
4. Consumer (Order Processor) picks up OrderCreated
5. Calls Inventory Service (async)
6. Calls Payment Service (async)
7. On success: publishes OrderConfirmed
8. On failure: publishes OrderFailed, sends to DLQ
9. Email Service subscribes to OrderConfirmed, sends confirmation
10. Analytics Service subscribes to OrderConfirmed, updates dashboards
Resilience patterns applied:
- At-least-once delivery: Order Processor ACKs after committing order state
- Idempotency: Each order_id processed only once (due to database unique constraint)
- Saga pattern (choreography): Services react to events; no orchestrator
- Dead letter queue: Failed orders sent to manual review queue
- Back-pressure: Order Processor limits prefetch to 50 concurrent orders
- Timeout and retry: Kafka Streams configured with exponential backoff
Key metrics to monitor:
- Consumer lag (orders in queue waiting to be processed)
- Processing latency (time from OrderCreated to OrderConfirmed)
- DLQ depth (number of orders in manual review)
- End-to-end order-to-delivery time
Anti-Patterns and Common Pitfalls
1. Ignoring Ordering Guarantees
Pitfall: Assuming messages arrive in order when they don’t.
Reality: Kafka guarantees order within a partition, not across partitions. RabbitMQ guarantees order per queue, but if you add multiple consumers, they consume in parallel (unordered).
Fix: If order matters, partition by entity (order_id, user_id) so correlated messages go to the same partition/queue.
# Kafka: partition key ensures order per order_id
producer.send("orders", key=order_id, value=event)
# Consumer sees events in order for this order_id
2. No Monitoring of Consumer Lag
Pitfall: “The queue is working; no need to monitor.”
Reality: Consumer lag creeps up unnoticed until the system is hours behind. By then, recovery is complex.
Fix: Set up alerts on consumer lag. Alert if lag exceeds 5 minutes (or your SLA).
# Kafka
lag = broker_offset - consumer_offset
if lag > lag_threshold:
alert("High consumer lag")
3. Synchronous Error Handling in Async Systems
Pitfall: Expecting an error response immediately when processing is async.
# Wrong
response = enqueue_order(order_data)
if not response.success:
return error_page() # But enqueue_order returned 202; processing hasn't started yet
Fix: Return 202 Accepted immediately. Provide a callback or polling endpoint to check status.
# Right
response = enqueue_order(order_data)
return {
"status": 202,
"message": "Order accepted for processing",
"check_status_url": f"/orders/{order_id}/status"
}
4. Unbounded Message Size
Pitfall: Trying to send a 1GB message through your message broker.
Reality: Brokers have default message size limits (Kafka: 1MB default; RabbitMQ: no strict limit but memory pressure). Messages larger than limits fail silently or hang.
Fix: Store large payloads in external storage (S3, object store); pass a reference in the message.
# Instead of: producer.send("orders", large_data)
# Do this:
s3.put_object("my-bucket", f"orders/{order_id}.json", large_data)
producer.send("orders", {"order_id": order_id, "data_url": s3_url})
5. No Idempotency Design
Pitfall: Retry mechanisms without idempotent consumers. Duplicate messages cause double-charge, double-booking, or inconsistent state.
Fix: Design every consumer to be idempotent from the start. Use idempotency keys, database constraints, or natural idempotency (set operations).
Conclusion: Building Resilient Async Systems
Asynchronous processing is not optional in modern distributed systems—it’s foundational. The patterns, guarantees, and broker choices determine whether your system degrades gracefully or cascades into failure.
Key takeaways:
- Decouple in time, space, and failure. Use async patterns to isolate services and allow independent scaling.
- Choose the right broker: Kafka for event replay and fan-out; RabbitMQ for workflows; SQS for AWS-native simplicity; NATS for ultra-low latency.
- Make delivery semantics explicit. At-most-once trades loss for speed; at-least-once requires idempotency; exactly-once costs latency.
- Implement dead letter queues and monitoring. Poisoned messages must be visible and actionable.
- Design idempotently. Retries are inevitable; ensure they’re safe.
- Use sagas for multi-step workflows. Orchestration is explicit but couples services; choreography is loose but complex to debug.
- Monitor and alert on lag, throughput, and error rates. Async systems fail silently without visibility.
The goal is a system where failures are contained, recovery is automatic, and the entire system continues moving forward even when individual components stumble. Async processing, done right, makes that possible.
