Testing Strategies for Large Systems

Testing at scale is a different beast than testing a small application. When you have a million lines of code with dozens of services, a failed test might be a real bug or a flaky test. Tests that ran in seconds might take minutes. Dependencies become impossible to mock. Coverage metrics become meaningless if tests aren't actually catching bugs.

The testing pyramid is your foundation: many fast unit tests at the base, fewer integration tests in the middle, and a handful of critical end-to-end tests at the top. This structure inverts how many teams actually test—they skip units, write integration tests, and have a massive e2e suite that takes forever. Then they wonder why bugs slip through and deployments are slow.

Why This Matters

Tests are your insurance policy against regression. They're not about achieving a coverage percentage. They're about catching mistakes before they hit production. A system with 50% coverage and the right tests catches more bugs than one with 90% coverage and bad tests.

Speed is critical. If your test suite takes an hour to run, developers stop running it locally. They run a subset, miss problems, and break the build. If your test suite runs in seconds, developers run it constantly. Bugs get caught during development, not in code review or production.

Maintenance burden grows with codebase size. A test written for a simple function is easy to maintain. A test that spans ten services and mocks everything is fragile. Change one service and multiple tests break. You're maintaining tests, not running them.

Isolation prevents cascading failures. When tests are properly isolated, one broken test tells you exactly what's wrong. When tests have hidden dependencies, one failure can mask ten problems. Isolation is a property you design for from the start, not something you add later. This is why SOLID principles matter for testability.

The Testing Pyramid

Unit tests are fast, focused, and numerous. They test a single function or class in isolation. Dependencies are mocked. Database calls are replaced. Network calls don't happen. A unit test runs in milliseconds.

Example unit test:

describe('calculateDiscount', () => {
  it('applies percentage discount correctly', () => {
    const discount = calculateDiscount(100, 0.1);
    expect(discount).toBe(10);
  });

  it('applies maximum discount cap', () => {
    const discount = calculateDiscount(100, 0.9); // 90% requested
    expect(discount).toBe(20); // but max is 20%
  });
});

javascript

Good unit tests are specific about what they test. They test the happy path, the edge cases, and the error cases. A function that handles three scenarios should have three tests minimum.

Integration tests exercise multiple units working together. A service talks to a real database. An API endpoint processes a request. A payment processor integrates with the payment gateway. These tests are slower but catch integration bugs—things that work in isolation but fail together.

Example integration test:

describe('UserService integration', () => {
  it('creates user and indexes in search', async () => {
    const user = await userService.createUser({ name: 'Alice' });
    const found = await searchService.findUser('Alice');
    expect(found).toBe(user.id);
  });
});

javascript

Integration tests use real implementations but test against a test database. They don't need to mock everything—in fact, mocking too much defeats the purpose.

End-to-end tests exercise the entire system. A user interacts with your UI. Data flows through multiple services. The payment gateway processes a real (test) transaction. E2E tests are slow and brittle, so you keep them minimal. They test critical user journeys that would be catastrophic if they broke.

Example e2e flow:

User navigates to checkout
Enters credit card
Completes purchase
Receives confirmation email
Order appears in their account

That's one e2e test. You might have five to ten total. Not fifty.

Test Isolation Strategies

Tests fail when dependencies aren't isolated. The classic problem: a test modifies shared state that other tests depend on.

Setup and teardown ensure each test starts clean. Before each test, create fresh data. After each test, clean it up.

describe('OrderService', () => {
  let database;

  beforeEach(async () => {
    database = await createTestDatabase();
  });

  afterEach(async () => {
    await database.clear();
  });

  it('creates orders', async () => {
    const order = await orderService.create({ items: [] });
    expect(order.id).toBeDefined();
  });
});

javascript

Mocking replaces external dependencies with test doubles. A mock doesn't hit the database or call the API. It returns test data you control.

const mockPaymentGateway = {
  charge: jest.fn().mockResolvedValue({ success: true, id: 'txn_123' })
};

const service = new CheckoutService(mockPaymentGateway);
const result = await service.checkout({ amount: 100 });
expect(mockPaymentGateway.charge).toHaveBeenCalledWith(100);

javascript

Use mocks for external systems (APIs, payment gateways, email services). Don't mock everything—mock only the things that make the test slow or unreliable.

Fakes are fully functional implementations used only for testing. A fake database that lives in memory. A fake email service that stores emails in a list instead of sending them. Fakes are better than mocks when you need realistic behavior.

class FakeEmailService {
  constructor() {
    this.sent = [];
  }

  send(email) {
    this.sent.push(email);
    return Promise.resolve({ id: 'email_123' });
  }
}

javascript

Test containers let you run real services in isolated containers for integration tests. Start a real PostgreSQL instance for your test, destroy it when the test finishes. Tools like testcontainers make this practical.

const postgres = new PostgresContainer();
await postgres.start();
const db = new Database(postgres.getConnectionUri());
// test here
await postgres.stop();

javascript

Dealing with External Dependencies

External dependencies—third-party APIs, payment processors, email services—are tricky. You can't make real calls in tests. You can't wait for responses. You need isolation.

Spy on HTTP calls to verify requests without making real requests. A library like nock intercepts HTTP calls and returns fake responses.

nock('https://api.payment.com')
  .post('/charge')
  .reply(200, { success: true, id: 'txn_123' });

const result = await chargeCard('4111111111111111', 100);
expect(result.success).toBe(true);

javascript

Use contracts to formalize what you expect from external services. Consumer-driven contract testing (using Pact) lets you generate test data from your expectations and share it with the service provider.

describe('PaymentAPI contract', () => {
  it('charges card successfully', async () => {
    await expect(paymentAPI.charge({
      amount: 100,
      card: '4111111111111111'
    })).resolves.toEqual({
      success: true,
      transactionId: expect.any(String)
    });
  });
});

javascript

Property-Based Testing

Property-based testing generates test inputs automatically and checks that properties hold across all inputs. Instead of writing ten test cases, you define a property and the framework generates hundreds of test cases.

Example property: "sorting a list should produce a list of the same length with the same elements."

fc.property(
  fc.array(fc.integer()),
  (input) => {
    const sorted = quickSort(input);
    expect(sorted.length).toBe(input.length);
    expect(sorted).toEqual(jasmine.arrayContaining(input));
  }
);

javascript

The framework generates random arrays, sorts them, and checks the property. If it fails on any input, it shrinks the input to the minimal failing case.

Property-based testing finds edge cases humans miss. It's particularly useful for algorithms and mathematical operations.

Mutation Testing

Mutation testing verifies that your tests actually catch bugs. It modifies your code (mutates it) and checks whether your tests fail. If a test passes despite a mutation, your test isn't catching that bug.

Example: Your code increments a counter. A mutation changes counter++ to counter--. If your tests pass, they're not actually checking the counter value.

// Original code
counter++;

// Mutation 1
counter--;  // tests should catch this

// Mutation 2
// do nothing  // tests should catch this too

javascript

A good test suite kills most mutations. If your mutation score is 40%, your tests are leaving gaps.

Tools like Stryker generate mutations and run your tests:

npx stryker run

Bash

Output shows which mutations survived, revealing test gaps.

Performance Testing

Performance tests catch regressions in speed. They're not unit or integration tests—they're separate and run less frequently.

describe('performance', () => {
  it('queries 100k records within 500ms', async () => {
    const start = Date.now();
    const results = await database.query('SELECT * FROM users LIMIT 100000');
    const duration = Date.now() - start;
    expect(duration).toBeLessThan(500);
  });
});

javascript

Performance tests run against realistic data sizes. A query that takes 10ms on 1000 records might take seconds on a million.

Practical Strategies for Large Codebases

Test in layers. Run unit tests on every commit. Run integration tests in pre-merge CI. Run e2e tests nightly. This catches most bugs quickly while keeping developer feedback fast.

Use test sharding. Split your test suite across machines. If you have 10,000 unit tests, run them in parallel across ten machines. Feedback stays under five minutes.

Keep e2e tests minimal. Pick your most critical user journeys. Test those. Don't test every button click through the UI.

Mock at boundaries. Boundaries are where your code meets external systems. Mock there. Don't mock in the middle of your business logic.

Maintain test quality. Review tests like you review code. Flaky tests are worse than no tests. Track test failures and fix flaky ones immediately.

FAQ

What's a good code coverage percentage?

There's no universal answer. 80% is reasonable for most teams. The important metric is whether your tests catch bugs. A codebase with 50% coverage and good tests beats one with 90% coverage and bad tests.

How do we deal with flaky tests?

Flaky tests fail intermittently, usually due to timing issues or external dependencies. Fix them immediately—they're worse than no test. If a test depends on timing, make it deterministic. If it depends on an external service, mock it.

Should we test private methods?

No. Test the public interface. If you find yourself testing private methods, the method probably belongs in a separate class.

How do we test asynchronous code?

Return promises from tests. The test framework waits for the promise to resolve.

it('fetches data', () => {
  return service.fetchData().then(data => {
    expect(data).toBeDefined();
  });
});

javascript

Or use async/await:

it('fetches data', async () => {
  const data = await service.fetchData();
  expect(data).toBeDefined();
});

javascript

What's the difference between mocking and spying?

A mock replaces a function entirely with a test double. A spy wraps a real function and tracks calls but lets it execute.

const mock = jest.fn().mockReturnValue(42);
const spy = jest.spyOn(Math, 'floor');

javascript

How often should we refactor tests?

As often as you refactor code. Tests are code. When the codebase changes, tests change too. Outdated tests are misleading.

Primary Sources

Robert Martin's handbook on writing testable, clean code and test strategies. Clean Code
Google's testing practices and guidelines for large-scale systems. Google Eng Practices
The Pragmatic Programmer's approach to testing strategies and quality assurance. Pragmatic Programmer
Steve McConnell's comprehensive guide to software testing and quality assurance. Code Complete
John Ousterhout's philosophy on designing testable, modular systems. Philosophy of Design
Google SRE practices for testing and reliability in production systems. SRE Workbook

Why This Matters

The Testing Pyramid

Test Isolation Strategies

Dealing with External Dependencies

Property-Based Testing

Mutation Testing

Performance Testing

Practical Strategies for Large Codebases

FAQ

What's a good code coverage percentage?

How do we deal with flaky tests?

Should we test private methods?

How do we test asynchronous code?

What's the difference between mocking and spying?

How often should we refactor tests?

Primary Sources

More in this hub

Get Started with Bitloops.