Observability In Distributed Systems

You deploy a service. It works. Then at 3 AM, something goes wrong. Latency spikes. Errors occur. You have no idea why. This is monitoring without observability.

Monitoring answers "is it working?" Observability answers "why isn't it working?" Monitoring tells you something is wrong. Observability helps you find the root cause.

A monolith is somewhat observable by default. You can see logs and add print statements. A distributed system is opaque. A user action spans 10 services. A database query fails in a way that cascades through the system. You need built-in observability or you'll be blind.

The Three Pillars

Observability has three components: logs, metrics, and traces. All three are necessary.

Logs

Logs are the most familiar. Print statements. Event records. "User ID 123 logged in." "Query took 150ms." "Error: database connection timeout."

Structured Logging is critical. Don't log free-form strings. Log structured data so you can query and analyze.

# Bad
logging.info(f"User {user_id} logged in from {ip_address} at {timestamp}")

# Good
logging.info("user_login", {
    "user_id": user_id,
    "ip_address": ip_address,
    "timestamp": timestamp,
    "region": geoip.lookup(ip_address),
})

Python

Structured logging lets you query: "how many logins from China in the last hour?" Without structure, you're searching text.

Log levels matter. Use them consistently:

DEBUG: Low-level information for developers. Query times, cache hits/misses.
INFO: High-level events. Login, logout, significant state changes.
WARN: Something unexpected but recoverable. A retry, a fallback, high latency.
ERROR: Something broke. A query failed, an API returned 500.
CRITICAL: System-breaking failures. A database is down, we're out of memory.

Don't log everything at INFO level or you'll be drowned in noise.

Sampling is important at scale. Logging every request to a busy service generates gigabytes of logs per day. Instead, sample: log 1% of requests, plus all errors.

if random.random() < 0.01 or error:  # 1% sample or all errors
    logger.info("request", {
        "path": request.path,
        "method": request.method,
        "duration_ms": duration,
        "status": response.status,
    })

Python

Metrics

Metrics are numbers. Request count. Error rate. CPU usage. Latency (p50, p95, p99).

Metrics are aggregated data, not individual events. You don't store a metric for every request (that's a log). You store a counter: "1000 requests in the last minute." "Error rate: 0.5%."

Metrics are cheap to store and fast to query. You can keep them forever and analyze trends over months.

Metric Types:

Counters: Only go up. Request count, error count. "How many requests have we served since startup?"
Gauges: Go up and down. Current CPU usage, active connections. "What's the CPU right now?"
Histograms: Distribution of values. Request latencies. "What's the p99 latency?"
Summaries: Like histograms but calculated at the source.

# Counter
request_count.inc()

# Gauge
active_connections.set(len(connection_pool))

# Histogram
request_duration.observe(duration_ms)

Python

Naming conventions matter for queryability.

http_requests_total{service="api", method="POST", path="/users", status="200"}
http_request_duration_seconds{service="api", method="GET", percentile="99"}
database_query_duration_seconds{service="api", query="select_user", percentile="95"}

Text

Traces

Traces track a single user action through the system. A user requests /checkout. This spawns:

Auth service validates the token (10ms)
Cart service fetches the cart (50ms)
Inventory service checks stock (100ms)
Order service creates the order (40ms)
Payment service processes payment (200ms)
Notification service sends email (30ms)

Total: 430ms

A trace shows this entire flow. You can see that the Payment service is slow, not the others.

Distributed tracing tools (Jaeger, Zipkin, Honeycomb) collect these traces. Each service emits trace data. A central collector assembles them.

OpenTelemetry is the standard. Instrumentation libraries emit traces. A collector gathers them.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("checkout") as span:
    span.set_attribute("user_id", user_id)

    with tracer.start_as_current_span("validate_token"):
        # Validation logic
        pass

    with tracer.start_as_current_span("fetch_cart"):
        # Cart logic
        pass

Python

Traces are expensive to store (detailed data, high volume). Sample them: trace 1% of requests, plus all errors.

Alerts and On-Call

Alerting is the handoff from systems to people. When something's wrong, notify the person responsible.

Bad Alerts:

Alert on every error. Even healthy systems have occasional errors (network timeouts, user input errors).
Alert on low-level metrics. A single slow query isn't alertable. A trend of slow queries is.
Alert on everything. Engineers ignore noisy alert systems (alert fatigue).

Good Alerts:

Alert on symptoms, not causes. "Error rate is 5%" (symptom) instead of "CPU is 80%" (cause, but might be fine if error rate is still 0.1%).
Alert when user experience is affected. "p99 latency > 1 second" affects user experience.
Alert with context. Include relevant data. "Error rate spiked from 0.1% to 5% in the last 5 minutes."

Alert Fatigue: Too many alerts leads to alert fatigue. Engineers stop responding. Critical alerts get ignored.

Combat fatigue with:

Few, high-signal alerts
Intelligent grouping (related alerts together)
Runbooks (what do you do when this alert fires?)
Escalation (if the alert isn't acknowledged in 5 minutes, escalate to the manager)

SLIs, SLOs, and SLAs

These terms are often confused.

SLI (Service Level Indicator): A metric that measures service quality.

"API latency p99 is 100ms"
"Error rate is 0.1%"
"Data freshness is < 1 minute"

An SLI is a measurement. It's what you can observe.

SLO (Service Level Objective): A target for the SLI.

"API latency p99 should be < 100ms"
"Error rate should be < 0.5%"
"Data should be fresh within 5 minutes"

An SLO is a goal. It's what you commit to internally.

SLA (Service Level Agreement): A contract with customers about SLOs.

"We guarantee 99.9% uptime"
"API latency p99 will be < 500ms"

An SLA is a business commitment. If you violate it, customers can get refunds.

Error Budgets: If your SLO is 99.9% uptime, you have an error budget of 43 minutes of downtime per month.

99.9% uptime = 0.1% downtime
0.1% * 525600 minutes/month = 52.5 minutes of allowed downtime

Downtime so far this month: 30 minutes
Remaining budget: 22.5 minutes

Text

Use error budgets to make deployment decisions. If you have no budget left, don't deploy risky changes. If you have lots of budget, you can take more risks.

Dashboards

Dashboards visualize metrics. They answer "is the system healthy?" at a glance.

Good Dashboards:

Show high-level health first. Is the system up? Is latency acceptable?
Surface trends. Is error rate increasing? Is memory usage growing?
Provide drill-down. Click on a spike to see details.
Are actionable. Show metrics that lead to decisions.

Bad Dashboards:

Show every metric (overwhelming)
Focus on low-level details (CPU, disk, memory) instead of user impact
Don't provide drill-down
Change so often that alerts become hard to track

A good dashboard for an API:

Overview: Uptime, request rate, error rate, p99 latency
Trends: Latency over time, error rate over time
Breakdown: Latency by endpoint, errors by reason
Resources: CPU, memory, connections (drill-down only)

Debugging with Observability

When something goes wrong, observability helps you find the cause.

Steps:

Check alerts. What triggered?
Check dashboards. What was the system doing?
Check recent changes. What deployed recently?
Look at traces. Which service is slow?
Look at logs. What errors are happening?
Analyze metrics. When did the problem start?

With logs, metrics, and traces, you can reconstruct what happened. Without them, you're guessing.

Observability in Production

You can't test everything. Some bugs only appear under load. Observability finds them.

Canary Deployments: Deploy to a small percentage of users. Monitor metrics. If error rate spikes, rollback immediately.

Feature Flags: Deploy code but disable features. Gradually enable for more users.

Synthetic Monitoring: Automated tests that check critical paths. "Every 5 minutes, test checkout flow." Alerts if it fails.

Real User Monitoring (RUM): Collect metrics from real users. Where are they slow? Where do they have errors? This data is more valuable than synthetic tests.

AI-Generated Code and Observability

Code generators tend to produce code without observability. No logging. No metrics. No traces. The generated code works, but you can't debug it in production.

Bitloops helps by generating code with observability built-in. Every data fetch is logged and traced. Errors are captured with context. The system is observable by default.

Frequently Asked Questions

Should I log everything?

No. Log important events (errors, state changes, API calls). Sample high-volume events. Avoid logging sensitive data (passwords, payment info).

How much tracing overhead is acceptable?

Tracing adds latency and CPU. Start with 1% sampling. If performance impact is acceptable, increase. If not, stay at 1%.

What metrics should I track?

Start with: request count, error rate, latency (p50, p95, p99), resource usage (CPU, memory, connections). Add domain-specific metrics later.

When should I alert?

When something affects users. High latency affects users. One slow query doesn't. Alert on user-impacting metrics, not low-level signals.

How do I reduce alert fatigue?

Alert on high-signal metrics only. Use thresholds that rarely trigger when everything is normal. Improve alert precision by correlating signals.

Should I monitor internal services differently?

Yes. External services have SLAs and user expectations. Internal services can be observed differently. Still track the metrics that matter for customers.

Primary Sources

Charity Majors' comprehensive guide to observability engineering and instrumentation. Observability Engineering
Google's Site Reliability Engineering book on monitoring and observability. SRE Book
Google SRE workbook with practical observability patterns and tools. SRE Workbook
OpenTelemetry project documentation for open standards observability. OpenTelemetry Docs
Prometheus documentation for metrics collection and monitoring systems. Prometheus Docs
Martin Kleppmann's guide to data-intensive systems observability and design. Designing Data-Intensive Applications
Brewer's update on CAP theorem and consistency monitoring in systems. CAP Twelve Years Later

The Three Pillars

Logs

Metrics

Traces

Alerts and On-Call

SLIs, SLOs, and SLAs

Dashboards

Debugging with Observability

Observability in Production

AI-Generated Code and Observability

Frequently Asked Questions

Should I log everything?

How much tracing overhead is acceptable?

What metrics should I track?

When should I alert?

How do I reduce alert fatigue?

Should I monitor internal services differently?

Primary Sources

More in this hub

Get Started with Bitloops.