Context Windows vs External Memory: When to Keep Knowledge In-Context
Context windows are expensive and finite. Some knowledge always matters (load it once). Some matters rarely (fetch on demand). Learn which is which, and you'll build agents that are cheaper, faster, and way less likely to hallucinate.
A context window is your model's working memory — a finite token budget where everything the model reasons about must fit simultaneously. Once you exceed that limit, information gets discarded, and the model can't see it anymore. The fundamental question isn't "how big should my context window be?" but rather "what knowledge actually needs to live in there?"
This distinction matters because context windows have three hard constraints: they cost tokens (money), they degrade attention over distance (the model doesn't reason as well about distant content), and they're finite (no matter how large, you'll eventually hit the ceiling). External memory — databases, vector stores, file systems, APIs — solves the size problem but introduces retrieval latency and requires the agent to know when to fetch.
Most teams make this backwards. They build a RAG pipeline because it's trendy, stuff everything into a vector database, and then wonder why their agents are slower and less accurate than if they'd just kept the critical stuff in context.
Why This Distinction Matters
The choice between in-context and external memory isn't about capacity. It's about the structure of your reasoning task. This mirrors the relationship between capturing semantic context (which requires persistent storage) and computing structural context on-demand.
Some knowledge is always needed. When you're building a code agent that modifies a codebase, certain architectural patterns, naming conventions, and design decisions are relevant to almost every decision the agent makes. If the agent has to fetch those patterns from a database every time it considers touching a file, you've optimized for the wrong thing. You pay latency costs, you introduce failure modes (what if the retrieval fails or returns stale data?), and you waste tokens on repeated context gathering.
Other knowledge is sometimes needed, in specific contexts. A particular function's implementation matters only when you're modifying that function or functions that call it. A test file is relevant only when you're debugging a failing test. These are high-value retrieval candidates — you fetch them when triggered by specific signals, not preemptively.
And some knowledge is rarely needed and huge. Dependency documentation, historical commit logs, archived feature branches — these are retrieval-only territory. Loading them into context wastes your token budget for almost no benefit.
The cost structure is different too. In-context knowledge is paid upfront: you load it once, and every token used in the context window, whether the model reads it or not, costs the same. External retrieval is paid per access: you pay when you fetch, not when you load. This matters. If you need to access something 50% of the time, keeping it in context might be cheaper than retrieving it every time. If you need it 2% of the time, retrieval wins decisively.
What a Context Window Actually Is
Under the hood, a context window is the maximum sequence length a transformer model can process. Every input — instructions, context, conversation history, examples — gets tokenized into individual tokens. The model's attention mechanism can reference any token in the window, but can't reference anything outside it.
When you're at position 10,000 in a 128K token window, the model can still theoretically attend to token 1. In practice, it doesn't attend equally. The famous "lost in the middle" problem describes this: models attended poorly to information in the middle of long contexts, performing worse than when that same information was placed at the beginning or end. This isn't a bug in your prompt — it's a property of how attention is trained and deployed.
This matters because it changes what "fitting in context" actually means. Sure, 128K tokens fits a lot of code. But if the model attends poorly to most of it, you've paid the token cost without getting the reasoning benefit. You'd be better off with 32K of high-signal context than 128K of diluted noise.
The context window also isn't free storage. Every token in the context window is processed by the model's attention mechanism. A 128K context window means 128K-squared operations (at minimum) during inference. This isn't just a cost — it's a latency cost. Bigger windows are slower windows. If you're building interactive agents, this matters.
Token cost scales with context size in both directions: input cost (paying for what you load) and output cost (paying for what the model generates). Some providers charge only for input, some for both. Either way, context decisions are cost decisions.
In-Context Learning vs Retrieval
In-context learning means the model learns patterns from examples provided in the prompt. Show it one example of how to format XML, and it internalizes the pattern. Show it three examples, and it's more confident. This is powerful because the model doesn't need to have seen this pattern during training — it can learn it from your specific examples.
This works because the model's weights don't change; the examples just guide its reasoning about this specific task. You're not fine-tuning. You're not modifying the model. You're shaping how the model uses its general knowledge for your specific problem.
Retrieval-based memory means you store information in a system outside the model and fetch it when needed. Vector databases, file systems, API calls — these are all retrieval systems. The model doesn't learn from them in the in-context learning sense. Instead, the model decides it needs information, makes a tool call, and incorporates the retrieved content into its reasoning.
The tradeoff: in-context learning is immediate and doesn't require tool calls, but it's limited by window size and attention degradation. Retrieval scales to arbitrary amounts of data and lets you keep information updated, but it requires the agent to know when to fetch and introduces failure modes.
Most teams miss the hybrid approach. You don't have to choose. You can keep lightweight summaries, architectural overviews, and decision trees in context, then have the agent fetch full details on demand. An index could live in context: "Request ID mapping is in /docs/api.md lines 45-60. Module dependency tree is in /architecture.txt. Test utilities are in /test/helpers.ts." The agent uses the index to decide when full retrieval is worth the cost.
When to Keep Knowledge In-Context
Put knowledge in-context if it's:
Small enough to fit: This seems obvious, but it's worth stating explicitly. If the knowledge is 50K tokens and your context window is 128K, you have room. If it's 90K tokens, you don't. Measure this. Don't estimate.
Frequently referenced: The more often you need it, the less the retrieval cost amortizes. If every action needs the architectural patterns, context loading wins. If one action in a hundred needs specific legacy code patterns, retrieval wins.
Relevant to almost all decisions: System design docs, naming conventions, module boundaries, the coding style guide. These shape every decision the agent makes. They belong in context.
Slow or unreliable to retrieve: If your retrieval system has latency, is prone to failures, or returns inconsistent results, that's a reason to keep more in context. The certainty of having it right there matters.
Requires cross-referencing: If the agent needs to reason about how multiple pieces relate to each other, having them all in context avoids round-trip retrieval costs. "How does this function interact with that module?" is answerable immediately if both are loaded.
Stable and unlikely to change: Configuration that shifts frequently should be retrieved. Configuration that's static can live in context.
Examples for a code agent:
- The project's architectural structure: in-context
- The coding standards and naming conventions: in-context
- The list of files in the project: in-context (or a lightweight index)
- The actual implementation of every function: external retrieval
- The test suite structure: in-context
- The contents of individual test files: external retrieval
When to Keep Knowledge External
Store knowledge externally if it's:
Large and mostly irrelevant: A codebase with 500 files of which you typically touch 5-10. Keep those 5-10 in context when working on them. Keep the other 485 in external storage.
Frequently updated: If configuration changes hourly or test fixtures are regenerated, retrieval ensures you get fresh data. In-context means stale data from the moment you load it.
Specific to particular tasks: Domain-specific knowledge that's not relevant to most decisions should be fetched when needed. The agent asks for it, gets it, uses it, and moves on.
Expensive to tokenize: Some data formats are token-inefficient. A large JSON object with deep nesting uses more tokens than the information density would suggest. If you only need parts of it sometimes, retrieval with filtering is cheaper.
Has temporal dimensions: Historical data, logs, version history. Current code belongs in context. Last year's code belongs in external storage, retrieved for migration or archaeology tasks.
Would degrade attention: If loading it pushes other important context into the "lost in the middle" zone, it's not worth it. Better to have less total context with good attention than more context with poor attention.
Examples for a code agent:
- Old versions of functions: external
- Dependencies and their full documentation: external (just keep a reference in context)
- Historical test runs and their outputs: external
- The full content of every file in the project: external (selective loading only)
- Team communication and decision records: external or selective loading
The Hybrid Approach: Indexes in Context
Most effective memory systems don't choose. They put lightweight indexes in context and full content in external storage.
An index is a summary or reference that fits in token budget:
[project structure index]
Core modules:
- api/: Request handling, HTTP layer
- database/: Data models and queries
- auth/: Authentication and authorization
- utils/: Shared utilities
File reference:
- /src/api/handlers.ts - HTTP request routing (lines 1-150)
- /src/auth/tokens.ts - JWT validation (lines 1-80)
- /src/database/models.ts - Data schemas (lines 1-200)
Architectural decision: Handlers delegate to service layer
Service layer handles business logic, returns plain objects
Database layer handles persistence
[end index]The agent uses the index to navigate: "I need to understand JWT validation. That's in /src/auth/tokens.ts lines 1-80." The agent then fetches exactly that range. No wasted tokens on content it doesn't need.
This works because human reasoning works the same way. You don't keep the entire codebase in your head. You know where things are and you fetch them when needed. The difference is you fetch them in O(1) time by walking across a room. An agent fetches them in O(network-latency) time by making an API call.
The hybrid approach requires good tooling. Your retrieval system needs to support:
- Precise range fetching (lines 10-20 of a file, not the whole file)
- Semantic search (find functions that implement a pattern, not just keyword matching)
- Cross-cutting retrieval (find all references to a function, not just its definition)
Token Cost Analysis
Let's get concrete about cost.
A modern large context model costs roughly:
- Input: $0.001 to $0.003 per 1K tokens (varies by provider and model)
- Output: $0.003 to $0.015 per 1K tokens
A code file is roughly 1 token per character of source code (often less with compression). A 2,000-character file is about 2,000 tokens.
Loading a codebase into context:
- 50 files × 2,000 tokens = 100K tokens loaded
- At $0.003 per 1K input tokens = $0.30 per request
- If you make 100 requests over a day = $30
Retrieving on demand:
- Retrieve 5 files average × 2,000 tokens each = 10K tokens per request
- At $0.003 per 1K input tokens = $0.03 per request
- If you make 100 requests over a day = $3
But wait. The on-demand retrieval cost assumes the agent knows exactly which files to fetch and fetches efficiently. If the agent guesses wrong and fetches the wrong files, it wastes tokens. If the agent makes multiple round-trips to gather context, costs multiply.
The real calculation is:
- Cost of context retrieval × number of retrieval round-trips × wrong-guess penalty + cost of maintaining large in-context window
If your agent is good at deciding what to fetch (high signal/low waste) and your retrieval system is fast (low latency overhead), retrieval wins. If your agent makes poor decisions or your retrieval system is slow, in-context wins.
Most teams underestimate the retrieval decision cost. An agent that makes 10 tool calls to gather context for a task that could have been pre-loaded wastes money and time. An agent that loads a 500K token context window to do a 50K token task wastes money and attention.
The Growing Problem: Windows vs Codebases
Context windows are getting larger. Claude 3.5 offers 200K. Some providers offer 1M-token windows. This seems to solve the problem.
It doesn't.
Codebases grow faster than context windows. A moderately complex project has:
- 100K+ lines of application code
- 50K+ lines of test code
- 30K+ lines of configuration
- 20K+ lines of documentation
- Dependencies, generated code, build artifacts
That's 200K+ lines easily. At 1-2 tokens per line, you're at 200K-400K tokens. A 1M-token window sounds huge until you realize it's not even 2× what you need.
And you can't load just the code. You need context about what you're doing (task description, examples, reasoning steps, conversation history). You need error messages, stack traces, test outputs. The actual tokens available for domain knowledge shrinks as the task gets more complex.
Moreover, the "lost in the middle" problem doesn't disappear with larger windows. It gets worse. A 1M-token window with poor attention to the middle-500K is worse than a 128K-token window where everything is in high-attention zones.
The right framing: context windows got bigger, so you have more tokens to work with. But the problem isn't going away. You still can't fit everything. You still need to make choices about what to load. You still pay attention degradation costs. The window size is a constraint that's becoming less binding, but it's not becoming irrelevant.
The "Just Make the Window Bigger" Fallacy
Some teams assume the solution to context problems is to buy a bigger window. Use the 200K model instead of the 128K model. Use the 1M-token model instead of the 200K model.
This trades one problem for another. Bigger windows mean:
- Slower inference: Larger context = more attention operations = slower response times. If your agent needs to respond in 2 seconds, a 1M-token window might not fit the latency budget.
- Higher cost: Every token costs. A 500K context window is more expensive than a 100K one, even if you use smaller models.
- Worse reasoning at distance: The attention degradation problem means your model reasons worse about content in the middle of huge contexts.
- Harder debugging: When outputs are wrong, was it because of missing context, or because the model didn't attend to relevant context that was there?
The real solution is proportional to your actual problem:
- If your issue is capacity (too much important knowledge), bigger windows help.
- If your issue is signal degradation (good knowledge buried in noise), retrieval helps more.
- If your issue is cost (too many tokens), better selection of what to load helps.
- If your issue is latency (models too slow), smaller contexts help.
Most teams have multiple issues at once. Throwing a bigger window at all of them is like trying to fix a car that runs too hot and costs too much to fill by buying a bigger gas tank. It addresses one part of the problem while making others worse.
Designing Your Memory Architecture
Start with this framework:
Tier 1: Always-loaded — System prompt, task description, architectural overview, naming conventions, critical patterns. This fits in a small token budget (10K-20K tokens) and is relevant to almost every decision.
Tier 2: On-demand per task — When the agent starts a specific task (e.g., "implement feature X"), load the files and modules directly involved. This is selective but complete. Maybe 20K-40K tokens depending on task scope.
Tier 3: Triggered retrieval — When the agent encounters something it doesn't understand (unfamiliar symbol, cross-file dependency), it fetches that specific context. Lightweight, targeted, reactive.
Tier 4: API/system access — When knowledge doesn't live in your codebase (external APIs, real-time data), the agent calls APIs or tools. Cost is tool-call latency, not context tokens.
This tiered approach means your context loading is a design question, not a throughput question. You design what goes where based on relevance and frequency, not based on "how much can I fit?"
This tiered approach also connects to how agents decide when to fetch additional context and how memory systems persist knowledge for long-term learning. Bitloops Context Engineering approach enables this because it gives you primitives for describing context tiers and rules for managing transitions between them. Rather than hand-rolling retrieval logic, you describe "when the agent encounters an unfamiliar module identifier, fetch its definition from the module index," and the system handles the fetching, caching, and token accounting.
Common Pitfalls
Pitfall 1: Everything-in-context-because-we-can Just because you have a 200K-token window doesn't mean you should use all of it. A 50K-token focused context often outperforms a 150K-token diluted context. Token budget is a design constraint, not a reason to be lazy.
Pitfall 2: Treating retrieval as the answer Retrieval systems aren't magic. A slow or unreliable retrieval system is worse than no retrieval system. Before building a vector database, ask: what's the actual problem? Is it speed? Is it token cost? Is it relevance? Different problems have different solutions.
Pitfall 3: Ignoring retrieval decision cost Every tool call costs. If your agent calls a retrieval tool five times to gather context that could have been pre-loaded, you've wasted five tool-call rounds. Tool calls have latency. They fail sometimes. Design with that cost in mind.
Pitfall 4: Keeping stale in-context knowledge If you load context once per session and the code changes, the agent has stale knowledge. It might make decisions based on outdated information. Either refresh the context when relevant code changes, or use retrieval to stay fresh. Don't have it both ways.
Pitfall 5: Perfect retrieval seeking Some teams build elaborate retrieval systems that perfectly rank documents by relevance. Then they realize their agent doesn't use the ranking efficiently. It still fetches the top result and gets stuck if that's not the right one. Perfect retrieval systems solve the wrong problem. Good-enough retrieval systems that are fast and reliable are more valuable.
Pitfall 6: Token-counting theater Some teams obsess over token counts without understanding attention dynamics. "We cut context from 100K to 80K tokens" sounds good until the agent's reasoning actually got worse because the 20K you cut was in high-attention positions. Measure what matters: accuracy and speed, not token count.
FAQ
How do I know if my context is too big?
Measure your model's performance on tasks where you vary context size. If performance plateaus and then degrades as you add more context, you've found your optimal window. If output latency is unacceptable, your context is too big. If token costs are growing faster than accuracy is improving, your context is too big.
Should I use vector embeddings or keyword search for retrieval?
Depends on what you're retrieving. For code, semantic search (embeddings) is often worse than keyword search because code is syntactically precise. "Find functions that handle authentication" works better as a keyword search for "authenticate" than as an embedding search. For documentation and prose, embeddings work well. Consider hybrid approaches that use both.
How often should I refresh in-context knowledge?
If the domain is static (architectural patterns, style guides), refresh rarely or never. If the domain is dynamic (file structure, code content), refresh before important tasks. Some systems refresh per-task, some per-request, some per-decision. Measure the staleness-cost tradeoff.
Can I use the same context window size for all tasks?
No. Some tasks need more context (understanding a complex module before refactoring it) and some need less (writing a small utility function). Design context loading per-task-type, not globally.
What's a reasonable token budget for in-context knowledge?
A reasonable baseline: 20% for system prompt and task description, 60% for domain knowledge, 20% for reasoning and output. Adjust based on your task. More complex tasks need more reasoning tokens. More knowledge-heavy domains need more context tokens.
How do I measure whether retrieval is helping?
Compare task success rates with retrieval vs without. Compare token costs. Compare latency. A retrieval system that improves accuracy but doubles latency might not be a win for your use case. A system that reduces token cost by 50% but reduces accuracy by 10% might not be a win either. You're optimizing a multi-variable system.
Should I retrieve context proactively (before the agent asks) or reactively (when needed)?
Proactively if you can predict what's needed (structured tasks with consistent context requirements). Reactively if the path is unpredictable (exploratory tasks, debugging). Hybrid is often best: pre-load likely context, retrieve on-demand when surprised.
Primary Sources
- Foundational work on transformer architecture showing how attention mechanisms process sequential inputs. Attention Is All You Need
- Empirical study demonstrating models attend poorly to information positioned in middle of long contexts. Lost in the Middle
- Combines retrieval systems with generation for improved performance on knowledge-intensive tasks. RAG Paper
- Framework for interleaving reasoning traces with tool-calling actions for complex task solving. ReAct
- Tree-based prompting enabling exploration of multiple reasoning paths for difficult problems. Tree of Thoughts
- Reference documentation for implementing retrieval patterns in language model applications. LangChain Retrieval
More in this hub
Context Windows vs External Memory: When to Keep Knowledge In-Context
5 / 10Previous
Article 4
Semantic Context for Codebases: Understanding Why Code Exists
Next
Article 6
Avoiding Context Overload in AI Agents: Smart Loading Strategies
Also in this hub
Get Started with Bitloops.
Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.
curl -sSL https://bitloops.com/install.sh | bash