Error Handling and Resilience Patterns

Every system experiences failures. Networks timeout. Services crash. Databases lock. Third-party APIs go down. These are the realities of distributed systems. The difference between a robust system and a fragile one is how it handles these failures.

Error handling has two levels. First, inside your code: what do you do when something goes wrong? Second, across your system: how do you keep running when components fail? Both matter.

Why This Matters

System reliability depends on it. A system that cascades failure (one component dies, which kills another, which kills another) is brittle. A system that isolates failures and gracefully degrades stays running.

User experience depends on it. A system that shows "Internal Server Error" is worse than one that shows "Payment processing is slow right now, please try again in a moment." The first is hostile. The second is helpful.

Cost depends on it. Every minute of downtime costs money and damages trust. Good error handling and resilience patterns prevent cascading failures and reduce downtime.

Error Handling Philosophy

Fail fast vs. fail gracefully is the tension.

Fail fast: when something is wrong, stop immediately and raise an error. This is good for detecting bugs early. Bad data is immediately visible.

function createUser(data) {
  if (!data.email) throw new ValidationError('Email required');
  if (!data.name) throw new ValidationError('Name required');
  // only continue if valid
  return database.insert(data);
}

javascript

Fail gracefully: when something is wrong, try to continue with degraded functionality. This is good for user experience. Non-critical features can fail without breaking everything.

function getUser(userId) {
  try {
    const user = await database.query('SELECT * FROM users WHERE id = $1', [userId]);
    return user;
  } catch (error) {
    logger.error('Database error fetching user:', error);
    return { id: userId, name: 'Unknown User', email: null };
  }
}

javascript

Use fail-fast for validation and core logic. Use fail-gracefully for non-critical features and integrations.

Exception Hierarchies

Organize exceptions by type. This lets callers handle different errors differently.

class ApplicationError extends Error {
  constructor(message, statusCode = 500) {
    super(message);
    this.name = 'ApplicationError';
    this.statusCode = statusCode;
  }
}

class ValidationError extends ApplicationError {
  constructor(message) {
    super(message, 400);
    this.name = 'ValidationError';
  }
}

class NotFoundError extends ApplicationError {
  constructor(message) {
    super(message, 404);
    this.name = 'NotFoundError';
  }
}

class PermissionError extends ApplicationError {
  constructor(message) {
    super(message, 403);
    this.name = 'PermissionError';
  }
}

class ExternalServiceError extends ApplicationError {
  constructor(message, retryable = true) {
    super(message, 503);
    this.name = 'ExternalServiceError';
    this.retryable = retryable;
  }
}

javascript

Now callers can handle each type appropriately:

try {
  const user = await getUser(userId);
} catch (error) {
  if (error instanceof ValidationError) {
    res.status(400).json({ error: error.message });
  } else if (error instanceof NotFoundError) {
    res.status(404).json({ error: error.message });
  } else if (error instanceof ExternalServiceError && error.retryable) {
    // retry later
  } else {
    res.status(500).json({ error: 'Internal error' });
  }
}

javascript

Error Boundaries

Error boundaries isolate failures. If one component crashes, it doesn't crash the entire application.

In React, use error boundaries:

class ErrorBoundary extends React.Component {
  componentDidCatch(error, errorInfo) {
    logger.error('Component error:', error, errorInfo);
  }

  render() {
    if (this.state.hasError) {
      return <div>Something went wrong. Please refresh.</div>;
    }
    return this.props.children;
  }
}

// Use it
<ErrorBoundary>
  <UserProfile userId={123} />
</ErrorBoundary>

javascript

If UserProfile crashes, the error boundary catches it and prevents the whole app from crashing.

Retry Patterns

Some failures are temporary. Network timeouts, service temporarily down, rate limiting. Retrying often succeeds.

Simple retry: Retry immediately, a few times.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      // try again
    }
  }
}

javascript

This is too simple. The service might be overloaded. Retrying immediately makes it worse.

Exponential backoff: Wait longer between each retry.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const delay = Math.pow(2, i) * 100; // 100ms, 200ms, 400ms, ...
      await sleep(delay);
    }
  }
}

javascript

Exponential backoff with jitter: Add randomness to prevent thundering herd.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const baseDelay = Math.pow(2, i) * 100;
      const jitter = Math.random() * baseDelay;
      const delay = baseDelay + jitter;
      await sleep(delay);
    }
  }
}

javascript

When many clients retry simultaneously, they might all retry at the same time (thundering herd). Jitter spreads them out.

Only retry on retryable errors (timeout, 5xx). Don't retry on validation errors (4xx) because retrying won't help.

Circuit Breaker

A circuit breaker prevents cascading failures. If a service is down, don't keep calling it. Stop and fail fast.

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, or HALF_OPEN
    this.lastFailureTime = null;
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
    }
  }
}

// Usage
const paymentBreaker = new CircuitBreaker(5, 60000);

async function processPayment(amount) {
  return paymentBreaker.call(() => paymentGateway.charge(amount));
}

javascript

States:

CLOSED: Normal operation, requests go through
OPEN: Too many failures, requests fail immediately without calling the service
HALF_OPEN: Timeout expired, try a request. If it succeeds, close. If it fails, reopen.

This prevents hammering a failing service.

Bulkhead Pattern

Isolate critical resources. If one feature consumes all database connections, other features fail. Bulkheads partition resources.

class ConnectionPool {
  constructor(size = 10) {
    this.size = size;
    this.connections = [];
    this.available = size;
  }
}

// Create separate pools for different features
const paymentConnections = new ConnectionPool(5);
const reportingConnections = new ConnectionPool(3);
const defaultConnections = new ConnectionPool(10);

async function executeQuery(query, pool = defaultConnections) {
  const conn = await pool.getConnection();
  try {
    return await conn.execute(query);
  } finally {
    pool.releaseConnection(conn);
  }
}

javascript

Now if payment queries consume all 5 connections, reporting still has 3 and default still has 10.

Timeout Strategies

Never make a request without a timeout. Requests without timeouts can hang forever.

async function fetchWithTimeout(url, timeout = 5000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeout);

  try {
    return await fetch(url, { signal: controller.signal });
  } finally {
    clearTimeout(timeoutId);
  }
}

javascript

Cascading timeouts: If A calls B calls C, set timeouts so they compose properly.

async function processOrder(order, timeoutMs) {
  const start = Date.now();

  // Call payment service with remaining time
  const remaining = timeoutMs - (Date.now() - start);
  const paymentResult = await paymentService.charge(order.amount, {
    timeout: remaining
  });

  // Call shipping service with remaining time
  const remaining2 = timeoutMs - (Date.now() - start);
  const shippingResult = await shippingService.ship(order, {
    timeout: remaining2
  });

  return { paymentResult, shippingResult };
}

processOrder(order, 30000); // 30 second total timeout

javascript

Graceful Degradation

What users see when things fail matters. Show helpful messages, not stack traces.

Critical features: If payment processing is down, tell users and retry automatically in the background.

Non-critical features: If recommendations service is slow, show default recommendations instead.

async function getRecommendations(userId) {
  try {
    return await recommendationService.get(userId, { timeout: 1000 });
  } catch (error) {
    logger.error('Recommendations failed:', error);
    return getDefaultRecommendations(userId);
  }
}

javascript

User sees recommendations either way.

Observability of Errors

You can't fix errors you don't see. Observability is critical.

Structured logging: Log errors with context.

logger.error('Payment processing failed', {
  userId,
  amount,
  error: error.message,
  stack: error.stack,
  retryable: error instanceof ExternalServiceError
});

javascript

Metrics: Track error rates by type.

metrics.increment('error.validation');
metrics.increment('error.external_service');
metrics.increment('error.database');

javascript

Monitoring: Alert when error rates spike.

alert.if(metrics.errorRate > 0.05, 'Error rate > 5%');

javascript

FAQ

Should we catch all errors?

No. Catch specific errors you can handle. Let others bubble up.

// Bad - catches everything
try {
  doSomething();
} catch (error) {
  // might be a real bug
}

// Good - catches specific error
try {
  doSomething();
} catch (error) {
  if (error instanceof ValidationError) {
    return badRequest(error.message);
  }
  throw error; // re-throw others
}

javascript

How many retries is reasonable?

Depends on the error type. 2-3 for transient errors. 0 for user errors. Exponential backoff helps.

When should we use circuit breakers?

For integration points with external services (APIs, databases, message queues). Not for internal functions. In microservices architectures, circuit breakers are essential at every service boundary.

How do we test error handling?

Test the happy path. Test error paths explicitly.

it('retries on transient error', async () => {
  let attempts = 0;
  const fn = jest.fn().mockImplementation(() => {
    attempts++;
    if (attempts < 3) throw new Error('transient');
    return 'success';
  });

  const result = await retry(fn, 3);
  expect(result).toBe('success');
  expect(fn).toHaveBeenCalledTimes(3);
});

javascript

How do we handle cascading failures?

Use circuit breakers, bulkheads, and timeouts. Fail fast when dependencies fail.

Should we log all errors?

Log errors you care about. Errors indicate something went wrong and you should know. But don't log expected validation failures as errors; those are normal.

Primary Sources

Sam Newman's guide to designing resilient microservices and error handling. Building Microservices
Nicole Forsgren's research-backed approach to building reliable systems. Accelerate
Google's engineering practices on error handling and system reliability. Google Eng Practices
The Pragmatic Programmer's approach to resilience and error recovery. Pragmatic Programmer
Steve McConnell's guide to error handling and defensive programming. Code Complete
Robert Martin's handbook on writing resilient, maintainable code. Clean Code
Google SRE practices for reliability, resilience, and error handling. SRE Workbook

Why This Matters

Error Handling Philosophy

Exception Hierarchies

Error Boundaries

Retry Patterns

Circuit Breaker

Bulkhead Pattern

Timeout Strategies

Graceful Degradation

Observability of Errors

FAQ

Should we catch all errors?

How many retries is reasonable?

When should we use circuit breakers?

How do we test error handling?

How do we handle cascading failures?

Should we log all errors?

Primary Sources

More in this hub

Get Started with Bitloops.