Skip to content
Bitloops - Git captures what changed. Bitloops captures why.
HomeAbout usDocsBlog
ResourcesEngineering Best PracticesError Handling and Resilience Patterns

Error Handling and Resilience Patterns

Systems fail—networks timeout, services crash, data corrupts. Good error handling keeps you running when parts break. Retry patterns, circuit breakers, and bulkheads stop cascading failures and keep users from seeing 500 errors.

8 min readUpdated March 4, 2026Engineering Best Practices

Every system experiences failures. Networks timeout. Services crash. Databases lock. Third-party APIs go down. These are the realities of distributed systems. The difference between a robust system and a fragile one is how it handles these failures.

Error handling has two levels. First, inside your code: what do you do when something goes wrong? Second, across your system: how do you keep running when components fail? Both matter.

Why This Matters

System reliability depends on it. A system that cascades failure (one component dies, which kills another, which kills another) is brittle. A system that isolates failures and gracefully degrades stays running.

User experience depends on it. A system that shows "Internal Server Error" is worse than one that shows "Payment processing is slow right now, please try again in a moment." The first is hostile. The second is helpful.

Cost depends on it. Every minute of downtime costs money and damages trust. Good error handling and resilience patterns prevent cascading failures and reduce downtime.

Error Handling Philosophy

Fail fast vs. fail gracefully is the tension.

Fail fast: when something is wrong, stop immediately and raise an error. This is good for detecting bugs early. Bad data is immediately visible.

function createUser(data) {
  if (!data.email) throw new ValidationError('Email required');
  if (!data.name) throw new ValidationError('Name required');
  // only continue if valid
  return database.insert(data);
}
javascript

Fail gracefully: when something is wrong, try to continue with degraded functionality. This is good for user experience. Non-critical features can fail without breaking everything.

function getUser(userId) {
  try {
    const user = await database.query('SELECT * FROM users WHERE id = $1', [userId]);
    return user;
  } catch (error) {
    logger.error('Database error fetching user:', error);
    return { id: userId, name: 'Unknown User', email: null };
  }
}
javascript

Use fail-fast for validation and core logic. Use fail-gracefully for non-critical features and integrations.

Exception Hierarchies

Organize exceptions by type. This lets callers handle different errors differently.

class ApplicationError extends Error {
  constructor(message, statusCode = 500) {
    super(message);
    this.name = 'ApplicationError';
    this.statusCode = statusCode;
  }
}

class ValidationError extends ApplicationError {
  constructor(message) {
    super(message, 400);
    this.name = 'ValidationError';
  }
}

class NotFoundError extends ApplicationError {
  constructor(message) {
    super(message, 404);
    this.name = 'NotFoundError';
  }
}

class PermissionError extends ApplicationError {
  constructor(message) {
    super(message, 403);
    this.name = 'PermissionError';
  }
}

class ExternalServiceError extends ApplicationError {
  constructor(message, retryable = true) {
    super(message, 503);
    this.name = 'ExternalServiceError';
    this.retryable = retryable;
  }
}
javascript

Now callers can handle each type appropriately:

try {
  const user = await getUser(userId);
} catch (error) {
  if (error instanceof ValidationError) {
    res.status(400).json({ error: error.message });
  } else if (error instanceof NotFoundError) {
    res.status(404).json({ error: error.message });
  } else if (error instanceof ExternalServiceError && error.retryable) {
    // retry later
  } else {
    res.status(500).json({ error: 'Internal error' });
  }
}
javascript

Error Boundaries

Error boundaries isolate failures. If one component crashes, it doesn't crash the entire application.

In React, use error boundaries:

class ErrorBoundary extends React.Component {
  componentDidCatch(error, errorInfo) {
    logger.error('Component error:', error, errorInfo);
  }

  render() {
    if (this.state.hasError) {
      return <div>Something went wrong. Please refresh.</div>;
    }
    return this.props.children;
  }
}

// Use it
<ErrorBoundary>
  <UserProfile userId={123} />
</ErrorBoundary>
javascript

If UserProfile crashes, the error boundary catches it and prevents the whole app from crashing.

Retry Patterns

Some failures are temporary. Network timeouts, service temporarily down, rate limiting. Retrying often succeeds.

Simple retry: Retry immediately, a few times.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      // try again
    }
  }
}
javascript

This is too simple. The service might be overloaded. Retrying immediately makes it worse.

Exponential backoff: Wait longer between each retry.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const delay = Math.pow(2, i) * 100; // 100ms, 200ms, 400ms, ...
      await sleep(delay);
    }
  }
}
javascript

Exponential backoff with jitter: Add randomness to prevent thundering herd.

async function fetchWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const baseDelay = Math.pow(2, i) * 100;
      const jitter = Math.random() * baseDelay;
      const delay = baseDelay + jitter;
      await sleep(delay);
    }
  }
}
javascript

When many clients retry simultaneously, they might all retry at the same time (thundering herd). Jitter spreads them out.

Only retry on retryable errors (timeout, 5xx). Don't retry on validation errors (4xx) because retrying won't help.

Circuit Breaker

A circuit breaker prevents cascading failures. If a service is down, don't keep calling it. Stop and fail fast.

class CircuitBreaker {
  constructor(threshold = 5, timeout = 60000) {
    this.failureCount = 0;
    this.threshold = threshold;
    this.timeout = timeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, or HALF_OPEN
    this.lastFailureTime = null;
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = 'HALF_OPEN';
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.threshold) {
      this.state = 'OPEN';
    }
  }
}

// Usage
const paymentBreaker = new CircuitBreaker(5, 60000);

async function processPayment(amount) {
  return paymentBreaker.call(() => paymentGateway.charge(amount));
}
javascript

States:

  • CLOSED: Normal operation, requests go through
  • OPEN: Too many failures, requests fail immediately without calling the service
  • HALF_OPEN: Timeout expired, try a request. If it succeeds, close. If it fails, reopen.

This prevents hammering a failing service.

Bulkhead Pattern

Isolate critical resources. If one feature consumes all database connections, other features fail. Bulkheads partition resources.

class ConnectionPool {
  constructor(size = 10) {
    this.size = size;
    this.connections = [];
    this.available = size;
  }
}

// Create separate pools for different features
const paymentConnections = new ConnectionPool(5);
const reportingConnections = new ConnectionPool(3);
const defaultConnections = new ConnectionPool(10);

async function executeQuery(query, pool = defaultConnections) {
  const conn = await pool.getConnection();
  try {
    return await conn.execute(query);
  } finally {
    pool.releaseConnection(conn);
  }
}
javascript

Now if payment queries consume all 5 connections, reporting still has 3 and default still has 10.

Timeout Strategies

Never make a request without a timeout. Requests without timeouts can hang forever.

async function fetchWithTimeout(url, timeout = 5000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeout);

  try {
    return await fetch(url, { signal: controller.signal });
  } finally {
    clearTimeout(timeoutId);
  }
}
javascript

Cascading timeouts: If A calls B calls C, set timeouts so they compose properly.

async function processOrder(order, timeoutMs) {
  const start = Date.now();

  // Call payment service with remaining time
  const remaining = timeoutMs - (Date.now() - start);
  const paymentResult = await paymentService.charge(order.amount, {
    timeout: remaining
  });

  // Call shipping service with remaining time
  const remaining2 = timeoutMs - (Date.now() - start);
  const shippingResult = await shippingService.ship(order, {
    timeout: remaining2
  });

  return { paymentResult, shippingResult };
}

processOrder(order, 30000); // 30 second total timeout
javascript

Graceful Degradation

What users see when things fail matters. Show helpful messages, not stack traces.

Critical features: If payment processing is down, tell users and retry automatically in the background.

Non-critical features: If recommendations service is slow, show default recommendations instead.

async function getRecommendations(userId) {
  try {
    return await recommendationService.get(userId, { timeout: 1000 });
  } catch (error) {
    logger.error('Recommendations failed:', error);
    return getDefaultRecommendations(userId);
  }
}
javascript

User sees recommendations either way.

Observability of Errors

You can't fix errors you don't see. Observability is critical.

Structured logging: Log errors with context.

logger.error('Payment processing failed', {
  userId,
  amount,
  error: error.message,
  stack: error.stack,
  retryable: error instanceof ExternalServiceError
});
javascript

Metrics: Track error rates by type.

metrics.increment('error.validation');
metrics.increment('error.external_service');
metrics.increment('error.database');
javascript

Monitoring: Alert when error rates spike.

alert.if(metrics.errorRate > 0.05, 'Error rate > 5%');
javascript

FAQ

Should we catch all errors?

No. Catch specific errors you can handle. Let others bubble up.

// Bad - catches everything
try {
  doSomething();
} catch (error) {
  // might be a real bug
}

// Good - catches specific error
try {
  doSomething();
} catch (error) {
  if (error instanceof ValidationError) {
    return badRequest(error.message);
  }
  throw error; // re-throw others
}
javascript

How many retries is reasonable?

Depends on the error type. 2-3 for transient errors. 0 for user errors. Exponential backoff helps.

When should we use circuit breakers?

For integration points with external services (APIs, databases, message queues). Not for internal functions. In microservices architectures, circuit breakers are essential at every service boundary.

How do we test error handling?

Test the happy path. Test error paths explicitly.

it('retries on transient error', async () => {
  let attempts = 0;
  const fn = jest.fn().mockImplementation(() => {
    attempts++;
    if (attempts < 3) throw new Error('transient');
    return 'success';
  });

  const result = await retry(fn, 3);
  expect(result).toBe('success');
  expect(fn).toHaveBeenCalledTimes(3);
});
javascript

How do we handle cascading failures?

Use circuit breakers, bulkheads, and timeouts. Fail fast when dependencies fail.

Should we log all errors?

Log errors you care about. Errors indicate something went wrong and you should know. But don't log expected validation failures as errors; those are normal.

Primary Sources

  • Sam Newman's guide to designing resilient microservices and error handling. Building Microservices
  • Nicole Forsgren's research-backed approach to building reliable systems. Accelerate
  • Google's engineering practices on error handling and system reliability. Google Eng Practices
  • The Pragmatic Programmer's approach to resilience and error recovery. Pragmatic Programmer
  • Steve McConnell's guide to error handling and defensive programming. Code Complete
  • Robert Martin's handbook on writing resilient, maintainable code. Clean Code
  • Google SRE practices for reliability, resilience, and error handling. SRE Workbook

Get Started with Bitloops.

Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.

curl -sSL https://bitloops.com/install.sh | bash