Consistency Models And Failure Handling

Consistency means different things in different contexts. In databases, consistency is a property of transactions. In distributed systems, it's about different replicas seeing the same data. In user experience, it's about the UI showing what actually happened.

These aren't the same. A database can be "consistent" while the user's browser shows stale data. A distributed system can achieve strong consistency while the user perceives inconsistency because they're offline.

Understanding what consistency you need, and at what cost, is central to building reliable systems.

Consistency Models

Strong Consistency: Every read returns the most recent write. If you write a value and immediately read it, you get what you wrote.

Cost: slow. To guarantee strong consistency across replicas, you must synchronize before returning. This takes time.

Use when: data accuracy is critical. Account balances, permissions, medical records.

Eventual Consistency: Reads might return stale data, but eventually all replicas converge to the same state.

Cost: low. Writes return immediately. Consistency happens asynchronously.

Use when: staleness is tolerable. Social media feeds, notifications, non-critical caches.

Causal Consistency: If action A caused action B, you always see A before B. But unrelated actions might be out of order.

Cost: moderate. Requires tracking causality.

Example: You write a comment. Your friend reads the post. They see your comment (because the causality is preserved). But they might not see another comment that was also added (because it's causally unrelated).

Read Your Own Writes: You see your own writes immediately, but other users might see your changes with a delay.

Cost: low to moderate.

Example: You edit your profile. Immediately, you see your changes. Your friend sees your changes a few seconds later.

Linearizability

Linearizability is a specific kind of strong consistency. It means the system behaves as if there's a single, globally-ordered sequence of operations.

Imagine a bank account. Two people withdraw simultaneously. If linearizability is maintained, one withdrawal happens first, then the other. The balance is consistent. If not, both might think they successfully withdrew, resulting in an inconsistent balance.

Achieving linearizability across geographically distributed systems is expensive or impossible. You have to wait for the slowest replica. This introduces latency.

Distributed Transactions

A transaction spans multiple services. "Transfer $100 from Account A to Account B." If A and B are in different services, you need coordination.

Two-Phase Commit (2PC): A coordinator asks all services to prepare. If all agree, it commits the transaction across all services. If any disagree, it rolls back everywhere.

Coordinator: "Are you ready to transfer $100 from Account A?"
Service A: "Yes, I reserved the funds"
Service B: "Yes, I reserved the capacity"

Coordinator: "Commit"
Service A: "Committed"
Service B: "Committed"

YAML

Pros: Strong consistency. Atomic. Either the transaction succeeds everywhere or fails everywhere.

Cons: Slow (requires coordination). Blocks. If one service is down, the whole transaction is blocked.

Saga Pattern: Instead of a distributed transaction, break the operation into a sequence of local transactions.

1. Debit Account A: -$100
2. Credit Account B: +$100
3. Log the transfer

If step 2 fails:
3. Reverse Account A: +$100
4. Log the reversal

Text

Each step is a local transaction (fast). If a step fails, compensating transactions reverse previous steps.

Pros: No blocking. More resilient. Each step is fast.

Cons: Eventual consistency. For a brief period, Account A is debited but Account B isn't credited. Complexity (need compensating transactions).

Choreography vs. Orchestration:

Choreography: Services react to events. Service A publishes "AccountA.Debited." Service B listens and credits Account B.

Orchestration: A central service (Saga orchestrator) coordinates. "First debit A, then credit B." More explicit control.

Compensating Transactions

When a step fails, undo previous steps. This is a compensating transaction.

def transfer(account_a, account_b, amount):
    try:
        debit(account_a, amount)
        try:
            credit(account_b, amount)
        except:
            # Compensating transaction
            credit(account_a, amount)  # Undo the debit
            raise
    except:
        log(f"Transfer failed: {account_a} -> {account_b}")
        raise

Python

Compensating transactions must be idempotent (safe to apply multiple times). If the credit fails, you undo the debit. If the undo fails (network timeout), you retry. Retrying the undo shouldn't cause problems.

Idempotency

An operation is idempotent if applying it multiple times has the same effect as applying it once.

"Set username to 'Alice'" is idempotent. Applying it twice leaves the username as 'Alice'.

"Increment balance by 10" is not idempotent. Applying it twice increments twice.

In distributed systems, idempotency is crucial because operations can be retried.

Idempotency Keys: Include a unique key with the operation. The server records which keys have been processed.

POST /transfers
{
  "idempotency_key": "transfer-20240305-001",
  "from": "account_a",
  "to": "account_b",
  "amount": 100
}

Text

If the same request is sent twice:

First request: Transfer succeeds. Server records the key.
Second request: Server sees the key was already processed. Returns the same result without re-processing.

Failure Scenarios

Real systems fail in ways you don't expect.

Network Partition: Services can't communicate. A thinks B is dead. B thinks A is dead. They diverge.

Slow Service: A service is responding, but very slowly (200ms instead of 10ms). Retries timeout. The system degrades.

Cascading Failures: Service A is slow. Service B waits for A. Service B becomes slow. Service C waits for B. Everything becomes slow.

Data Corruption: A service crashes mid-operation, leaving data in an inconsistent state.

Resilience Patterns

Timeouts: If a service doesn't respond within a timeout, assume it failed. Don't wait forever.

response = requests.get('http://slow-service', timeout=5)

Python

Retries with Backoff: Transient failures might recover. Retry with exponential backoff.

for attempt in range(3):
    try:
        return fetch_data()
    except TransientError:
        wait(2 ** attempt)  # Wait 1s, 2s, 4s

Python

Circuit Breaker: If a service is returning errors, stop calling it. Return an error immediately instead of waiting.

circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)

def call_service():
    if circuit_breaker.is_open():
        raise ServiceUnavailable()

    try:
        response = service.call()
        circuit_breaker.record_success()
        return response
    except Error:
        circuit_breaker.record_failure()
        raise

Python

If 5 calls fail, the circuit opens. Subsequent calls fail immediately for 60 seconds. After 60 seconds, the circuit half-opens. One request is allowed through to test if the service recovered.

Bulkheads: Isolate resources. Different services use different connection pools. If one service exhausts its pool, others aren't affected.

# Service A has its own pool
pool_a = ConnectionPool(max_size=10)

# Service B has its own pool
pool_b = ConnectionPool(max_size=10)

# If A exhausts its pool, B still works

Python

Fallback: When something fails, use a fallback.

def get_user_recommendations():
    try:
        return ml_service.get_recommendations()
    except:
        # Fallback: return static recommendations
        return ["item_1", "item_2", "item_3"]

Python

Graceful Degradation: When the system is struggling, degrade functionality instead of failing completely.

Instead of: "Recommendation service is down. You can't view the product."

Do: "Recommendation service is slow. Showing you related products instead of personalized recommendations."

Optimistic vs. Pessimistic Failures

Optimistic: Assume the operation will succeed. Update the client immediately. If it fails, revert.

const addItem = (item) => {
  store.cart.push(item);  // Optimistic

  fetch('/api/cart/items', { method: 'POST', body: JSON.stringify(item) })
    .catch(() => {
      store.cart = store.cart.filter(i => i !== item);  // Revert
    });
};

javascript

Good for: operations that usually succeed. Responsive UX.

Bad for: operations that might fail. Can show incorrect UI temporarily.

Pessimistic: Assume the operation might fail. Wait for confirmation before updating the client.

const addItem = (item) => {
  fetch('/api/cart/items', { method: 'POST', body: JSON.stringify(item) })
    .then(() => {
      store.cart.push(item);  // Update after success
    });
};

javascript

Good for: critical operations. Safe UX.

Bad for: common operations. Slower perception.

Intent Preservation

When an operation fails, preserve the user's intent. Don't just show an error and forget.

Instead of: "Add to cart failed. Try again."

Do: "Added to cart. Syncing... (show spinning indicator). Offline, will sync when you're back online."

Intent preservation means:

Queue operations that fail due to network issues
Retry when connectivity returns
Show the UI as if the operation succeeded (optimistic)
Sync in the background

Eventually Consistent UIs

When the system is eventually consistent, the UI must reflect that.

Show:

When data is being fetched: "Loading..."
When data is stale: "Last updated 5 minutes ago"
When data is being saved: "Saving..."
When sync fails: "Not synced. Will retry."

Users can understand these states. What they can't understand is silently stale data.

AI-Generated Code and Failure Handling

Code generators tend to be optimistic. They assume operations succeed. They don't handle timeouts, retries, or fallbacks.

Bitloops helps by generating failure-aware code. Timeouts are set by default. Retries are automatic. Fallbacks are defined. The generated system is resilient.

Frequently Asked Questions

Should I use strong consistency or eventual?

Use strong consistency when correctness is critical (financial transactions, permissions). Use eventual consistency when speed matters (social feeds, recommendations). Hybrid: strong for critical data, eventual for less critical.

How do I handle transactions across services?

Use the Saga pattern. Break into local transactions. Use compensating transactions to undo on failure. Add idempotency keys to handle retries.

What's the difference between a timeout and a retry?

Timeout: how long to wait for a response. If no response, assume failure. Retry: try again if the first attempt fails.

Use both: set a timeout (5 seconds). If timeout, retry with backoff (wait 1s, then 2s, then 4s).

How do I know if data is stale?

Timestamp it. "This data was fetched at 2024-03-05 10:30:00." Show the timestamp to users. They can decide if it's recent enough.

Can I lose data with eventual consistency?

No (if designed properly). With eventual consistency, data is replicated across multiple nodes. As long as one node has the data, it's not lost. But for a brief period, different replicas might have different data.

How do I handle conflicts in eventual consistency?

Last-write-wins: latest timestamp wins. User-resolved: show both versions, let user choose. CRDTs: data structure that merges automatically.

Primary Sources

Martin Kleppmann's comprehensive guide to data-intensive systems and consistency. Designing Data-Intensive Applications
Chris Richardson's guide to the Saga pattern for distributed transactions. Saga Pattern
Google's Site Reliability Engineering book on failure handling and recovery. SRE Book
Google SRE workbook with practical failure handling strategies and patterns. SRE Workbook
Brewer's CAP theorem update addressing consistency and partition tolerance. CAP Twelve Years Later
Apache Kafka documentation covering failure handling and durability guarantees. Kafka Docs