What Is Context Engineering? The Discipline Behind Effective AI Coding Agents

Q: How does context engineering affect token costs?

Done well, it reduces total token consumption. Without a loading strategy, agents waste tokens on irrelevant context in every session and spend additional tokens on rework caused by missing context. A stratified loading strategy delivers high-signal context at minimum cost — avoiding both the waste of over-loading and the rework cost of under-loading. Most teams find that the token savings from fewer retries and less rework more than offset the cost of context delivery.

Why This Matters Now

Every AI coding agent — Claude Code, Cursor, Gemini CLI, GitHub Copilot — does some form of codebase analysis. They scan files, grep for symbols, and try to pull relevant information into the context window. That's not nothing. But it's also not enough.

The analysis is shallow and session-scoped. The agent builds a rough snapshot of the files it can see, infers relationships from naming patterns, and starts generating. It doesn't have a real dependency graph. It doesn't have persistent memory of previous changes and reasoning. It doesn't know that calculateShipping() was refactored three weeks ago to fix a production bug, or that the team agreed not to introduce direct database access from the presentation layer. Every session starts from a blank slate.

The consequences are predictable. Multi-file changes look plausible but routinely break things the agent was never meant to touch. Refactors miss downstream consumers because the dependency map was inferred, not computed. Code review gets slower because reviewers have to reconstruct the context the agent never had. The same mistakes recur because nothing from previous sessions carries forward.

Think about what a senior engineer does before writing a single line of code. They don't open the target file and start typing. They load a mental picture of the surrounding system first — which modules depend on this component, what changed in this area recently and why, whether there was a production issue here six months ago that introduced a constraint the code doesn't explain. That accumulated understanding, built over months, is what separates a senior engineer from a capable junior on an unfamiliar task.

A language model has none of that. No matter how capable, every session begins the same way: scan the files in the context window, infer what you can from patterns, start generating. Structurally, every agent starts every session as a junior engineer on their first day.

Context engineering exists to close that gap.

The Two Types of Context

Codebase context splits into two fundamentally different categories. Confusing them — or treating them as interchangeable — is one of the most common mistakes teams make when building context systems.

Structural Context

Structural context is the algorithmic, deterministic understanding of what the code is and how it connects. What does this symbol depend on? What depends on it? Where does it sit in the module hierarchy? What are the cross-file relationships?

The best way to get this is Abstract Syntax Tree (AST) parsing — not pattern matching, not embedding similarity. AST analysis gives you precise dependency graphs, symbol definitions, scope hierarchies, and call chains. It's verifiable and reproducible.

Here's a critical design decision that most teams get wrong: structural context should be computed on demand, not pre-stored. Codebases change constantly. A pre-built structural index goes stale the moment someone commits. On-the-fly AST parsing means the structural picture always reflects the current state of the code at the exact moment the agent needs it. Yes, there's a computation cost at call time — but that cost is justified precisely because it occurs when the information is needed and when accuracy matters most.

Semantic Context

Semantic context is the accumulated understanding of what the code means within the system — not its syntactic definition, but its role, purpose, and history. Why was this function written this way? What domain concept does this symbol represent? What patterns does this team use when working in this area? What was tried and rejected the last time this code was modified?

You can't compute semantic context from the code alone. It requires persistent storage — a knowledge store that captures reasoning, decisions, and usage patterns over time and makes them retrievable. Where structural context is always current by computation, semantic context is always rich by accumulation.

The two types complement each other. Structural context tells the agent what a symbol is and how it connects. Semantic context tells it what it means and why it exists. An agent with only structural context produces code that's syntactically valid but potentially incoherent with the team's intentions. An agent with only semantic context produces code that's conceptually right but may misunderstand the current state of dependencies. You need both.

The Loading Strategy

Here's where most teams go wrong. They treat context as a single thing that should be maximised. One big instructions file. One comprehensive system prompt. Everything loaded into every session.

This works the same way keeping your entire reference library on your desk works. Wasteful, cluttered, expensive. Your PR review checklist is sitting in context while the agent is debugging a memory leak. Every irrelevant token loaded is a token unavailable for the actual work.

Your context window is precious. Use it wisely.

Context engineering requires a loading strategy — a deliberate design for what loads when, and at what cost. Different knowledge should load differently.

Always-Loaded Context

The structural picture of the codebase: key symbols, dependency relationships, architectural patterns, project conventions. Present every session because it's always relevant. High cost — it consumes tokens on every interaction — but justified because it applies to every task.

The question this tier answers: "What does the agent need regardless of what it's working on?" Keep it lean. If something is only relevant to specific features, it doesn't belong here.

On-Demand Context

The decision and conversation history for a specific file, feature, or module. A small description sits in the always-loaded index so the agent knows the context exists. The full detail — previous conversations, reasoning traces, past decisions, bug history — loads only when the agent is actually working on that code.

The question: "What does the agent need about this specific area?" Low cost by design: the expensive knowledge is only paged in when needed.

Delegated Context

Heavy analytical work that would eat the agent's primary context window if done inline. A sub-agent or isolated process handles the analysis — scanning a large set of files, comparing implementation patterns across modules, analysing test coverage — and returns only a structured summary to the main session.

The question: "What analysis is too expensive to do in the main conversation?" Near-zero cost to the primary session because only the conclusion enters the context window, not the work.

Automated Context

Context captured without model involvement at all. A post-commit hook fires on every commit, recording what changed, which files were touched, and which symbols were modified. This runs outside the model's context entirely — no token cost, no model invocation. Deterministic by design.

The question: "What should be recorded without ever entering the model's context?" Zero cost to the agent. The value accumulates over time as the knowledge store grows.

The Design Principle

The right information, at the right time, at the right cost.

If context is always loaded but rarely relevant, it's in the wrong tier. If context is never loaded but frequently needed, the agent is operating blind. Every context decision should be evaluated against this standard.

How Context Delivery Works in Practice

The mechanics follow a consistent pattern:

1. Task Identification. When an agent begins work, the system identifies which parts of the codebase are relevant. This isn't keyword search — it's a structured determination based on the files being modified, their dependencies, and the nature of the change.

2. Context Assembly. The system assembles a context bundle from multiple sources: the structural graph (computed on demand via AST parsing), the relevant semantic history (retrieved from the knowledge store), and any always-loaded conventions or constraints. This bundle is ranked by relevance — recency, proximity to the task, similarity to past decisions.

3. Token Budget Management. The assembled context gets pruned and packed to fit within the agent's available token budget. This is where context ranking and token budgeting becomes critical. Every piece of context is ranked by expected value to the current task, and the system cuts from the bottom. Maximum signal density within the available space.

4. Delivery. One tool call. One structured payload. Everything the agent needs to begin work with genuine codebase understanding. No multiple round-trips, no iterative retrieval loops.

The agent receives what a senior engineer would carry into the same task: what the code is, what it depends on, why it was built the way it was, and what constraints apply.

Context Engineering vs. RAG

Retrieval-Augmented Generation (RAG) is a well-established pattern: embed documents, store them in a vector database, retrieve similar chunks at query time, inject them into the prompt. RAG works well for knowledge bases, documentation search, and question-answering systems.

Context engineering for codebases needs more than this. The differences are structural.

RAG retrieves text chunks by similarity. Context engineering retrieves structured knowledge — dependency graphs, symbol definitions, decision histories, constraint sets — that has internal relationships and hierarchy. A vector similarity search can't reliably surface the fact that Function A depends on Function B which was refactored last week to fix a production bug. That requires a graph, not an embedding.

RAG treats all retrieved content as equal. Context engineering ranks and stratifies by type, relevance tier, and cost. Structural context is computed fresh. Semantic context is retrieved from persistent storage. Always-loaded context is always present. On-demand context is paged in based on the task. These are fundamentally different retrieval mechanisms serving different purposes.

RAG operates at query time. Context engineering operates across the full lifecycle: capture at commit time, indexing at install time, structural computation at call time, delivery at task time. It's not a retrieval step — it's a system design.

Vector retrieval absolutely has a role, particularly for semantic similarity search within the knowledge store. But treating codebase context engineering as "just RAG" misses the structural, temporal, and hierarchical dimensions that make it work.

Common Pitfalls

Loading everything into every session. This is the trap of prompt injection. One big markdown file with all the rules, all the conventions, all the patterns. Avoiding context overload requires tier-based loading. Context engineering is about precision, not volume.

Relying on embedding similarity alone. Vector similarity finds content that's textually related. It doesn't reliably find content that's structurally related — upstream dependencies, downstream consumers, cross-module constraints. If you're doing dependency analysis with embeddings, you're guessing. Structural context requires structural retrieval.

Pre-building static indexes for structural context. A structural index of the codebase is stale the moment someone commits. On-demand AST parsing costs more at call time but is always accurate. For structural context, freshness is non-negotiable. This is the one place where you genuinely can't trade freshness for speed.

Ignoring the capture side. Context engineering isn't just about delivering context to agents. It's equally about capturing the reasoning and decisions that agents produce. Without capture, the knowledge store never grows, and every future session starts from scratch. This is a loop, not a pipeline — and most teams only build the delivery half.

Treating context as a prompt engineering problem. Better prompts can't compensate for missing context. If the agent doesn't know that a function was refactored last week to fix a critical production bug, no amount of prompt optimisation will prevent it from reverting that fix. Context engineering addresses the knowledge gap. Prompt engineering addresses the instruction gap. Different problems, different solutions.

Building delivery without ranking. Retrieving relevant information is necessary but not sufficient. The retrieved context has to be ranked by expected value to the current task and packed efficiently into the available budget. Without ranking, the window fills with marginally relevant information while critical knowledge gets truncated.

The Bigger Picture

Context engineering is the foundational discipline for AI-native software development. Every other concern — governance, quality enforcement, code review, team scaling — depends on agents having sufficient context to produce coherent output in the first place. You can't govern what you can't understand, and you can't understand AI-generated code without the context that produced it.

The discipline is emerging now because AI coding tools have crossed from assistants to agents. When a tool suggests a single line of code, the consequences of missing context are minor — you spot the issue, you fix it. When a tool autonomously plans and executes multi-file changes across your architecture, the consequences of missing context are serious. The shift from suggestion to agency is what makes context engineering a requirement, not a nice-to-have.

Tools like Bitloops address this by sitting between any AI coding agent and the codebase — assembling structured context bundles before agents work and capturing the full reasoning trace back into a persistent knowledge store after every commit. This is context engineering implemented as infrastructure: agent-agnostic, local-first, and designed to compound in value with every interaction.

Frequently Asked Questions

What's the difference between context engineering and prompt engineering?

Prompt engineering is about how you phrase instructions — the wording, structure, and framing of what you ask the model to do. Context engineering is about what knowledge accompanies those instructions — dependency relationships, historical decisions, domain constraints, usage patterns. Think of it this way: prompt engineering determines the quality of the question. Context engineering determines the quality of the evidence the model reasons over.

Why can't AI coding agents just read the whole codebase?

Context windows are finite. Even the largest models can't fit a meaningful codebase into a single prompt. But more importantly, raw code files without structural analysis, historical context, or semantic understanding produce shallow pattern matching, not genuine comprehension. It's the difference between giving someone a phone book and giving them a map. Context engineering delivers the right subset of codebase knowledge in a structured, ranked format.

How is context engineering different from RAG?

RAG retrieves text chunks by embedding similarity. Context engineering retrieves structured knowledge — dependency graphs, symbol definitions, decision histories — using multiple retrieval mechanisms (AST parsing, vector similarity, graph traversal) stratified by type and cost tier. RAG is one component that can serve the semantic retrieval layer, but it can't replace structural analysis or temporal context capture.

What is structural context vs. semantic context?

Structural context is the algorithmic understanding of what code is and how it connects: dependency graphs, symbol definitions, call chains, module hierarchies. Computed on demand via AST parsing, so it's always fresh. Semantic context is the accumulated understanding of what code means: its purpose, the reasoning behind it, the domain concepts it represents, how it's evolved. Stored persistently and retrieved from a knowledge store. You need both.

How does context engineering relate to AI code governance?

Governance depends on context. You can't meaningfully review AI-generated code without understanding the reasoning and constraints that produced it. Context engineering provides both the capture side — recording why decisions were made — and the delivery side — surfacing that reasoning during review. Without it, code review becomes reviewing diffs without intent. That's an act of faith, not a review process.

What does a context loading strategy look like?

Four tiers, each with different cost and relevance: always-loaded (present every session — conventions, key dependencies), on-demand (loaded when working on specific code — decision history, past reasoning), delegated (heavy analysis by sub-processes — only summaries enter the main session), and automated (captured deterministically outside the model — post-commit hooks, zero token cost).

Can you do context engineering without persistent storage?

Partially. You can compute structural context on demand without any storage. But semantic context — the reasoning behind past decisions, the patterns the team has reinforced, the constraints that emerged from previous work — requires persistent storage. Without it, every session starts from scratch. Persistent storage is what makes context engineering compound over time rather than reset every session.

How does context engineering affect token costs?

Done well, it reduces total token consumption. Without a loading strategy, agents waste tokens on irrelevant context in every session and spend additional tokens on rework caused by missing context. A stratified loading strategy delivers high-signal context at minimum cost — avoiding both the waste of over-loading and the rework cost of under-loading. Most teams find that the token savings from fewer retries and less rework more than offset the cost of context delivery.

Primary Sources

Combines retrieval systems with sequence-to-sequence models to access external knowledge for question answering. RAG Paper
Demonstrates how LLMs can interleave reasoning traces with tool-calling actions for complex task solving. ReAct
Official protocol specification for exposing tools and resources to AI agents through standardized interfaces. Model Context Protocol
Anthropic's guide to function calling, demonstrating how models can invoke external tools for extended capabilities. Anthropic Tool Use
Core transformer architecture using attention mechanisms, foundational to all modern large language models. Attention Is All You Need
Analysis of how transformer-based models attend to information positioned differently in long contexts. Lost in the Middle