Distributed Systems Fundamentals

Building a distributed system means accepting that things will fail. Networks partition. Servers crash. Data gets out of sync. The question isn't whether your system will experience these failures—it's whether your architecture can tolerate them.

Distributed systems fundamentals are about understanding what's possible and what's not. Some problems have proven solutions (consensus, for example). Some problems have no perfect solution, only tradeoffs (consistency, availability, partition tolerance—pick two). Many developers build distributed systems without understanding these constraints, and end up with systems that either fail silently or make implicit guarantees they can't keep.

The CAP Theorem

The CAP theorem is the most misunderstood result in distributed systems. It's not about choosing between three options. It's about understanding the consequence of network partitions.

Consistency: Every read returns the most recent write. If you write a value and immediately read it, you get back what you wrote.

Availability: Every request receives a response, even if the system is experiencing failures.

Partition Tolerance: The system continues operating even when network messages are delayed or lost (a partition).

The theorem states that in the presence of a network partition, you can guarantee two of the three, but not all three.

The catch: in practice, you always have partition tolerance because networks always partition. You can't avoid it. You can only accept it and choose between consistency and availability.

This reframes the question. It's not "should I build a CP or AP system?" It's "when the network partitions, should I block writes to maintain consistency, or accept writes to maintain availability?"

Consistency-First (CP): During a partition, the system refuses writes to the inconsistent partition. Users waiting for a response get errors. Consistency is maintained. The system is "available" but only to the partition with the majority of nodes.

Example: A distributed lock service. During a partition, you want to maintain the invariant that only one client holds a lock. So you reject lock requests during the partition.

Availability-First (AP): During a partition, the system accepts writes on both sides. Consistency is violated temporarily. When the partition heals, the system reconciles the conflicting writes.

Example: A social media feed. During a partition, both sides accept posts. When the partition heals, you reconcile by showing all posts. Temporary inconsistency is acceptable.

Most production systems aren't purely CP or AP. They're "CP by default with AP fallbacks." Normally, they enforce consistency. During extended partitions, they degrade to AP (accepting writes but maybe showing stale data) until the partition heals.

Network Partitions

Networks partition more often than you think. A fiber cut. A misconfigured router. A cloud region outage. Congestion on the network. Any of these causes nodes to stop talking to each other.

When a partition occurs, there's a critical question: is the other side dead, or is it just not responding? You can't tell the difference. A timeout looks the same whether the other side crashed or is just slow.

This matters because you have to make decisions without certainty. If you assume the other side is dead and start serving requests, you might be running in a partition where both sides are alive and diverging. If you assume the other side is alive and wait, you're unavailable while the partition lasts.

Quorum-based systems handle this by requiring a majority of nodes to agree before accepting writes. If you have 5 nodes and 3 are reachable, you can write. If only 2 are reachable, you can't. This guarantees that at most one partition can accept writes, preventing divergence.

But quorum systems have a cost: you need an odd number of nodes. With 5 nodes, you can survive one failure. With 3 nodes, you can't survive any (because 2 nodes can't form a quorum if the third is temporarily slow). This limits their usefulness.

Consensus Protocols

Consensus is the problem of getting a distributed system to agree on a value despite failures and delays. It's foundational.

Consensus protocols define rules for proposing values, voting, and deciding when a value is accepted. Two main families exist: Paxos (older, more complex) and Raft (newer, simpler).

Raft

Raft is designed for understandability. Leaders coordinate writes. Followers replicate. If the leader dies, followers elect a new leader.

The protocol has three roles:

Leader: Accepts client requests and replicates them to followers. Only the leader can handle writes.

Followers: Replicate the leader's writes. If the leader dies, they participate in electing a new leader.

Candidate: A follower that's attempting to become the new leader.

When a leader dies (detected by a timeout), followers become candidates and vote for a new leader. The candidate that gets a majority of votes becomes the new leader. This ensures at most one leader is elected.

Raft guarantees that if a write is committed (replicated to a majority), it will survive any failures going forward. A new leader will never revert committed writes.

Paxos

Paxos is older and more complex. It's been deployed in production for decades (Google Chubby, Apache Zookeeper). It's harder to understand than Raft, but the core guarantee is similar: writes replicated to a majority survive failures.

For most engineers, Raft is the more useful mental model. The specifics of Paxos matter less than understanding that consensus protocols exist, they're deterministic (not based on luck), and they require a majority to proceed.

Distributed State and Replication

Consensus tells you how to agree on a single write. But distributed systems must replicate state across multiple nodes. This creates new problems.

If you have a primary database and a replica, the primary accepts writes and the replica follows asynchronously. If the primary dies before replicating to the replica, the replica hasn't seen the write. When you promote the replica to primary, that write is lost.

This is a fundamental tradeoff. Synchronous replication (wait for the replica to acknowledge before returning success) is safe but slow. Asynchronous replication (return success immediately, replicate later) is fast but risky.

Replication Lag is the time between when a write is accepted and when all replicas see it. During this window, reads from different replicas return different results.

Read Consistency Models describe what you're guaranteed to see:

Strong Consistency: You always read the most recent write. But achieving this requires waiting for replication, which is slow. Only the primary can safely answer reads (or you must sync replicas before each read).

Eventual Consistency: Replicas might be behind. Your reads return potentially stale data. But reads are fast because they don't wait for synchronization.

Causal Consistency: If you see the effect of write A (data that was created by A), you will also see write A itself and any writes that causally preceded A. This is stronger than eventual consistency but weaker than strong consistency.

Most systems use eventual consistency by default and add strong consistency where needed. For a deeper dive into these tradeoffs, see consistency models and failure handling. Quoting prices requires strong consistency. A social media feed doesn't.

Message Ordering

In a distributed system, messages arrive out of order. This breaks assumptions that developers make when writing single-threaded code.

If you send message A, then message B, they might arrive in order B, A. Or they might be processed out of order due to queuing delays. Or one might be delivered, the other lost.

At-Least-Once Delivery: The system guarantees that messages are delivered at least once, but might deliver them multiple times. The receiver must handle duplicates.

Exactly-Once Delivery: The system guarantees that messages are delivered exactly once. This is expensive to implement. It requires the receiver to track which messages have been processed and deduplicate.

At-Most-Once Delivery: The system delivers the message at most once, but might lose it. This is common in real-world systems (UDP, for example). The sender must handle lost messages.

Most systems use at-least-once semantics and ensure their operations are idempotent (safe to apply multiple times).

Idempotency

An operation is idempotent if applying it multiple times has the same effect as applying it once.

Adding 10 to a number is not idempotent. Applying it twice gives different results (add 10, then add 10 again). Setting a value to 10 is idempotent. Applying it twice leaves the value at 10.

In distributed systems, idempotency is crucial because messages can be retried. If a payment message is delivered twice due to a network timeout, the operation must be safe to apply twice.

Design idempotent operations by including identifiers that make the operation uniquely identifiable:

# Not idempotent - applying twice charges twice
POST /payments
{"amount": 100, "account": "alice"}

# Idempotent - applying twice still results in one charge
POST /payments
{"idempotency_key": "user-alice-2024-03-05-0001", "amount": 100, "account": "alice"}

Python

The server records which idempotency keys it has processed. If the same key arrives again, it returns the same result without re-processing.

Clock Synchronization

Distributed systems often need to know the order of events across different servers. Did event A happen before event B?

Relying on wall-clock time (system clocks) is fragile. Clocks drift. Servers are in different timezones. NTP (Network Time Protocol) keeps clocks roughly synchronized, but not perfectly.

Logical Clocks solve this differently. Instead of relying on wall-clock time, they assign ever-increasing numbers to events. If you see logical clock value 5, you know it's newer than events with logical clock 4 or less.

Lamport Timestamps are simple: each event increments a counter. A server sends its counter value with each message. If a message arrives with a higher counter, the recipient updates its counter. This gives a total ordering of events.

Vector Clocks extend this for causality. Each server tracks a counter for every other server. When an event occurs, the server increments its own counter. When a message is sent, the full vector is attached. Receivers update their vector based on received vectors. This lets you determine if one event causally preceded another.

In practice, most systems use Lamport timestamps or logical clocks rather than relying on wall-clock time. It's more predictable.

The Eight Fallacies of Distributed Computing

These are assumptions that developers often make about distributed systems, all of which are false:

1. The network is reliable. Networks fail. Messages are lost. Servers don't respond. Your code must handle this.

2. Latency is zero. Even local network calls have latency. Distant calls have much more. Design for high latency. Don't make synchronous chains of calls.

3. Bandwidth is infinite. Networks have limited bandwidth. Compression matters. Efficient protocols matter. Don't transfer unnecessary data.

4. The network is secure. Your network peers might be compromised. Use authentication and encryption. Don't trust messages from peers.

5. Topology doesn't change. Servers are added and removed. IPs change. DNS records change. Your code must adapt to topology changes, not assume a fixed set of peers.

6. There is one administrator. In a large organization, different teams manage different parts of the system. Dependencies are complex. Assume you don't control the full stack.

7. Transport cost is zero. Sending data has cost—in latency, bandwidth, and computation. Don't treat network calls as free. Batch requests. Cache responses.

8. The network is homogeneous. Different parts of your system use different protocols, versions, and standards. Your code must handle heterogeneity. Don't assume all nodes speak the same language.

Practical Patterns for Handling Failure

Understanding theory is one thing. Building systems that don't fall apart is another.

Timeouts: When you call another service, set a timeout. If no response arrives within the timeout, fail the call. This prevents indefinite waiting.

Retries: Transient failures (network hiccup) are worth retrying. Permanent failures (invalid request) should not be retried. Implement exponential backoff to avoid overwhelming a struggling service.

Circuit Breakers: Part of a broader set of error handling and resilience patterns. If a service is returning errors, stop calling it. Return an error immediately to your client instead of trying again. When the service recovers, start calling it again. This protects both your service and the failing service.

Bulkheads: If one dependency fails, it shouldn't take down your whole service. Use separate thread pools or connections for different dependencies. If one exhausts its resources, others continue working.

Degradation: When a dependency is slow or unavailable, gracefully degrade. Show cached data. Show a fallback. Show a message to the user. Never fail silently.

AI-Native Considerations

Systems designed with AI assistance need to account for data access patterns. AI agents often fetch the same data repeatedly, miss optimization opportunities, and make synchronous chains of calls that don't scale.

Bitloops helps by generating code that understands distributed system patterns: circuit breakers, timeouts, retries, and caching are baked in. Generated code doesn't make the fallacies—it's built for failure.

Frequently Asked Questions

Can I avoid distributed systems?

Not if you want to scale or have high availability. A single server is simpler but has a fixed ceiling. Distributed systems are complex but enable growth.

How do I test distributed systems?

Chaos engineering. Deliberately inject failures (network partitions, latency, errors) and verify the system handles them correctly. Tools like Gremlin and Chaos Mesh automate this.

What's the difference between leader-based and leaderless replication?

Leader-based: one primary node handles all writes, replicas follow. Simpler but the leader is a bottleneck. Leaderless: all nodes accept writes and coordinate with each other. More complex but more resilient.

Should I use consensus for everything?

No. Consensus is expensive (requires a majority to agree, which is slow). Use it for critical state that must be consistent. For data that can be eventually consistent, use simpler replication.

What's the difference between partition tolerance and network reliability?

Partition tolerance means the system continues operating even when the network is degraded. Network reliability assumes the network works. We assume networks aren't reliable, so we build partition tolerance.

How do I know if my system is actually distributed?

If you have multiple processes that must coordinate, you have a distributed system. Whether they're on the same machine or different machines, the same problems appear.

Primary Sources

Martin Kleppmann's comprehensive guide to designing data-intensive systems. Designing Data-Intensive Applications
Ongaro and Ousterhout's Raft consensus algorithm paper for understandable fault tolerance. Raft Paper
Google's foundational Site Reliability Engineering book on distributed systems. SRE Book
Google SRE workbook with practical distributed systems patterns and solutions. SRE Workbook
Brewer's CAP theorem update addressing consistency in distributed systems. CAP Twelve Years Later
Peter Deutsch's foundational fallacies all distributed systems engineers must know. Fallacies
Apache Kafka documentation for distributed message queuing and replication. Kafka Docs