Asynchronous Operations In Distributed Systems
Async patterns decouple request handling from slow work. Message queues, webhooks, and event streaming let your system respond instantly while processing happens in the background. This is how systems scale.
Asynchronous processing is about decoupling. A user uploads a file. The request returns immediately. The file is processed in the background. The user isn't blocked waiting for slow work.
Synchronous processing forces coupling. If every request must wait for slow work to complete, latency compounds. Add a 100ms step, latency increases by 100ms. At some scale, this becomes unbearable.
Asynchronous processing trades immediate response for background work. The user gets a response instantly. The work happens later. This requires different thinking about failure handling, ordering, and consistency.
Why Async Matters
Responsiveness: User actions return quickly. The app feels responsive.
Decoupling: Producer and consumer don't need to be tightly coordinated. The producer publishes work. The consumer processes it whenever ready.
Resilience: If a consumer crashes, the work stays in the queue. When the consumer recovers, it processes the work.
Scalability: A queue can buffer work. If producers are faster than consumers, the queue grows. Consumers can scale independently to catch up.
Message Queues
A message queue stores messages until a consumer processes them.
RabbitMQ: A traditional message broker. Producers publish messages. Consumers subscribe to queues. Messages are delivered in order.
# Producer
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='email_queue')
channel.basic_publish(exchange='', routing_key='email_queue',
body=json.dumps({'to': 'user@example.com', 'subject': 'Welcome'}))
connection.close()
# Consumer
def callback(ch, method, properties, body):
message = json.loads(body)
send_email(message['to'], message['subject'])
ch.basic_ack(delivery_tag=method.delivery_tag)
channel.basic_consume(queue='email_queue', on_message_callback=callback)
channel.start_consuming()Pros: Simple. In-order delivery. Strong guarantees. Cons: Not distributed (single node failure). Limited to a single server (horizontal scaling is complex).
Apache Kafka: A distributed event streaming platform. Producers publish to topics. Consumers subscribe to topics. Messages are persisted to disk.
# Producer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
producer.send('email_topic', json.dumps({'to': 'user@example.com'}).encode())
# Consumer
consumer = KafkaConsumer('email_topic', bootstrap_servers=['localhost:9092'])
for message in consumer:
email_data = json.loads(message.value.decode())
send_email(email_data['to'], email_data['subject'])Pros: Distributed. Fault-tolerant. Supports many consumers. Cons: More complex. Eventual consistency (no strong ordering guarantees across topics).
AWS SQS: A managed message queue service. You don't run the infrastructure.
import boto3
sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/email_queue'
# Producer
sqs.send_message(QueueUrl=queue_url, MessageBody=json.dumps({...}))
# Consumer
response = sqs.receive_message(QueueUrl=queue_url, MaxNumberOfMessages=10)
for message in response.get('Messages', []):
process(message)
sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])Pros: Managed. Scalable. Simple. Cons: Limited ordering. Delayed visibility (messages aren't visible to other consumers for a period after receipt).
Event Streaming
Event streaming is like message queues but optimized for many consumers and replaying history.
In a message queue, a consumed message is deleted. In event streaming, messages are immutable. Consumers read at their own pace.
Kafka is the dominant platform here.
Use cases:
- Event Sourcing: Store all events. Reconstruct state by replaying events.
- Change Data Capture (CDC): Stream changes from the database to other systems.
- Real-Time Analytics: Analyze events as they happen.
# Stream all user signups to the data warehouse
consumer = KafkaConsumer('user_events', auto_offset_reset='earliest')
for event in consumer:
user_signup = json.loads(event.value.decode())
warehouse.insert('users', user_signup) # Idempotent insertAsync Request-Reply
Some async work requires a reply. "Process this image. Tell me when it's done."
Polling: The client asks repeatedly. "Is it done?" "Not yet." "How about now?"
# Client
response = requests.post('/process-image', files={'image': image_file})
job_id = response.json()['job_id']
# Poll for result
while True:
result = requests.get(f'/jobs/{job_id}')
if result.status_code == 200:
return result.json()
sleep(1)Pros: Simple. Cons: Inefficient (lots of requests). Latency (only checked every N seconds).
Webhooks: The server calls back when done.
# Client
response = requests.post('/process-image', json={
'image_url': 'https://example.com/image.jpg',
'callback_url': 'https://myclient.com/webhooks/image-processed'
})
# Server, after processing
requests.post(callback_url, json={'job_id': job_id, 'result': result_url})Pros: Efficient (no polling). Low latency. Cons: Requires the client to be reachable. More complex (need to handle retries if the webhook fails).
Message Queues with Correlation IDs: A request-reply pattern using queues.
# Client publishes request with correlation ID
correlation_id = uuid.uuid4()
producer.send('request_queue', {
'correlation_id': correlation_id,
'task': 'process_image',
'image_url': '...'
})
# Server processes and publishes reply
consumer.subscribe('request_queue')
for message in consumer:
result = process(message)
reply_producer.send('reply_queue', {
'correlation_id': message['correlation_id'],
'result': result
})
# Client receives reply
consumer.subscribe('reply_queue')
for message in consumer:
if message['correlation_id'] == correlation_id:
return message['result']Backpressure
What if producers are faster than consumers? The queue grows. Eventually, the system runs out of memory.
Backpressure is when a slower component signals the faster component to slow down.
# If queue is getting full, slow down producers
if queue.size() > 10000:
producer.wait_until_queue_size_decreases()
# Consumers process as fast as they can
for message in consumer:
process(message)Without backpressure, the producer can overwhelm the system. With backpressure, the system stays stable.
Dead Letter Queues
Some messages fail to process. If a consumer crashes while processing a message, or if the message format is invalid, the message should be moved to a dead letter queue.
try:
process_message(message)
except Exception as e:
logger.error(f"Failed to process: {e}")
dead_letter_queue.send(message)
# Don't acknowledge, message returns to queueDead letter queues are processed separately. An engineer reviews failures and decides whether to reprocess or discard.
Ordering Guarantees
In-Order: Messages are processed in order (RabbitMQ, single-partition Kafka).
Good for: related operations (edit then delete).
Bad for: parallel processing (can't scale).
Unordered: Messages can be processed in any order (Kafka with multiple partitions, SQS).
Good for: independent operations (can parallelize).
Bad for: related operations (might process delete before edit).
Partitioned Ordering: Messages are ordered within a partition, but partitions are processed independently.
# Send related messages to the same partition
message = {'user_id': 123, 'action': 'edit_profile'}
producer.send('user_actions', message, partition=hash(message['user_id']) % num_partitions)Same user's messages go to the same partition (ordered). Different users' messages go to different partitions (parallel).
Idempotency in Async
Since messages can be delivered multiple times, async operations must be idempotent.
# Idempotent: same email sent twice has same effect
def send_email(recipient, subject, content, email_id):
if already_sent(email_id):
return # Already sent, don't send again
email.send(recipient, subject, content)
mark_as_sent(email_id)Idempotency keys (email_id) prevent duplicate processing.
Async Workflow Patterns
Fan-Out-Fan-In: One event triggers multiple async tasks. Wait for all to complete.
# User signs up
publish('user.signed_up', {'user_id': 123})
# Multiple consumers react
# Consumer 1: send welcome email
# Consumer 2: create account in warehouse
# Consumer 3: send to marketing system
# When done, all have processed the eventChaining: One async task triggers another.
# Image uploaded
publish('image.uploaded', {'image_id': 456})
# Consumer 1: resize image
# publish('image.resized', {'image_id': 456})
# Consumer 2: extract metadata
# publish('metadata.extracted', {'image_id': 456})
# Consumer 3: update search indexRetries: If a task fails, retry with backoff.
try:
process(message)
except TransientError:
schedule_retry(message, delay=backoff(attempt_count))
except PermanentError:
send_to_dlq(message)Async and Consistency
Async introduces eventual consistency. A user signs up. The welcome email is sent asynchronously. For a moment, the user exists but hasn't received the email.
Design for this:
- Show "welcome email will arrive shortly"
- Don't depend on the email arriving immediately
- Log failures so engineers can investigate
- Provide a way to resend the email manually
AI-Generated Code and Async
Code generators often generate synchronous code. They fetch data immediately, process it, return the result. They don't think about decoupling or background work.
Bitloops helps by generating async workflows. Long-running work is automatically queued. Responses are immediate. Processing happens in the background.
Frequently Asked Questions
Should I use message queues or direct API calls?
Use queues when: work is slow, decoupling is important, failures are acceptable. Use direct calls when: you need a response immediately, order is critical, strong consistency is needed.
How do I handle backpressure?
Monitor queue size. If it's growing, slow down producers. This can be automatic (backpressure signals) or manual (monitoring and alerts).
What if a consumer crashes while processing a message?
The message remains in the queue (not acknowledged). When the consumer restarts, it processes the message again. Ensure operations are idempotent.
How do I order messages?
Use single-partition queues (ordered but can't parallelize). Or use partitioned queues with a partition key (ordered within partitions, parallelized across).
What's the difference between webhooks and message queues?
Webhooks: server calls client with result. One-to-one. Message queues: multiple consumers can process the same message. One-to-many. Use webhooks for direct responses. Use queues for fan-out.
How do I debug async workflows?
Add correlation IDs to messages. Track each message through the system. Log at each step. Use distributed tracing.
Primary Sources
- Neha Narkhede's comprehensive guide to Apache Kafka and stream processing at scale. Kafka Definitive Guide
- Videla and Williams' practical guide to RabbitMQ and message queue patterns. RabbitMQ in Action
- Martin Kleppmann's authoritative guide to data-intensive applications and systems. Designing Data-Intensive Applications
- Google Cloud's documentation for Pub/Sub messaging service. Google Cloud Pub/Sub
- Google's Site Reliability Engineering foundational book on systems design. SRE Book
- Google SRE team's practical guide to building reliable systems. SRE Workbook
- Brewer's update on CAP theorem and consistency models in distributed systems. CAP Twelve Years Later
- Apache Kafka's official documentation and architecture guide. Kafka Docs
More in this hub
Asynchronous Operations In Distributed Systems
7 / 10Previous
Article 6
Consistency Models And Failure Handling
Next
Article 8
Observability In Distributed Systems
Also in this hub
Get Started with Bitloops.
Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.
curl -sSL https://bitloops.com/install.sh | bash