Context Ranking and Token Budgeting
You have more context than fits in the window. Context ranking solves which bits matter most—using signals like recency, proximity, and semantic similarity—then packs them efficiently into your token budget. It's how you get agents to succeed with less, not more.
You have more relevant context than fits in the window. Your codebase is 50,000 files. The user asks "refactor the auth module." Everything is potentially relevant—the auth module itself, tests, dependencies, examples, architecture docs, conventions, security policies. But you have maybe 100,000 tokens to work with. What gets in?
This is the context ranking problem, and it's where agents either succeed or fail. Get the ranking wrong and the agent produces garbage despite having "enough context." Get it right and the agent produces excellent work using less than half the available budget.
Token budgeting isn't just optimization. It's a fundamental architectural problem that affects agent quality, cost, latency, and user experience. This article teaches you how to think about it.
The Core Problem: Value Per Token
The fundamental question is: which context has the highest value per token?
Not "which context is most relevant?" That's easy—fetch everything relevant.
Not "which context is largest?" That's the wrong signal.
The real question is: "which context, if included, most improves the agent's ability to complete this task?"
This is an information theory problem. You're optimizing signal-to-noise. Each token you include costs money and potentially adds noise (distracting the model from important information). Each token you exclude costs nothing but might cause the agent to miss a critical constraint.
The goal is to maximize the probability that the agent succeeds with a fixed token budget.
Context Ranking Signals
There are multiple signals that indicate whether context is valuable for a task. A sophisticated ranking system combines these signals into a single relevance score.
Signal 1: Recency
How recently was this context modified or accessed?
Scoring:
- Modified in the last commit: 100 points
- Modified in the last week: 80 points
- Modified in the last month: 60 points
- Modified in the last 6 months: 40 points
- Hasn't changed in over 6 months: 20 pointsWhy it works: Recently modified code is more likely to be relevant. If someone changed the auth module last week and you're now asked to refactor it, that's no coincidence.
Edge case: Don't weight recency too heavily. Old code can be critical. The HTTP library might not change for years but it's essential for any network request.
Signal 2: Proximity to Task
How close is this context to what the user asked for?
Scoring (for a task about "auth module"):
- Exact match (auth module itself): 100 points
- Direct dependency (uses auth module): 80 points
- Reverse dependency (imported by auth module): 70 points
- Same package/module: 60 points
- One level removed: 40 points
- Two+ levels removed: 20 pointsWhy it works: The auth module itself is obviously relevant. Code that calls the auth module is probably relevant. Code that the auth module calls might be relevant. Code that imports the auth module three levels up is probably not.
How to calculate: Use the dependency graph. If you don't have one, build one by parsing imports.
Signal 3: Structural Distance
How far away is this in the codebase structure?
Scoring (for editing src/auth/login.js):
- Same directory: 100 points
- Same package (src/auth/): 90 points
- Same parent (src/): 70 points
- Related package (src/models/): 50 points
- Different package: 30 points
- Config/infrastructure: 40 pointsWhy it works: Code physically near the target file is more likely to interact with it, follow the same conventions, and be part of the same logical unit.
How to calculate: Count directory traversals. The fewer traversals needed, the higher the score.
Signal 4: Semantic Similarity
How similar is the context to the task semantically? This connects to how agents learn meaning from their environment through semantic context.
This requires semantic search (usually embedding-based):
Example: Task is "add OAuth2 authentication"
- Existing OAuth2 implementation: 95% similarity
- Other auth methods: 80% similarity
- Token handling code: 70% similarity
- User model: 60% similarity
- HTTP utilities: 40% similarity
- Database migrations: 30% similarityWhy it works: If you're implementing OAuth2, studying an existing OAuth2 implementation teaches you more than a generic auth pattern.
How to implement: Use embeddings (OpenAI, Anthropic, or open-source embedding models). Embed the task description and all context. Rank by cosine similarity.
Cost: Embedding everything is expensive upfront, but you do it once and cache the results.
Signal 5: Historical Value
How often has this context been useful for similar past tasks?
Scoring (based on past usage):
- Used in 100% of similar tasks: 90 points
- Used in 80% of similar tasks: 80 points
- Used in 50% of similar tasks: 60 points
- Used in 20% of similar tasks: 40 points
- Rarely used: 20 pointsWhy it works: If your team always fetches the style guide when working on auth, the style guide is valuable. If the style guide is rarely fetched for other tasks, it's less valuable.
How to implement: Log what context is fetched for each task and its outcome. After 100 tasks, measure which context was most useful. Use that to weight future decisions.
This is your feedback loop. It's how your system learns.
Signal 6: Constraint or Pattern Density
How much constraint or pattern information is in this context?
Scoring:
- Configuration file that defines behavior: 95 points
- Code with unusual error handling: 85 points
- Architecture document: 80 points
- Security policy document: 85 points
- Example code showing a pattern: 70 points
- Generic utility function: 40 points
- Comment explaining why something works: 75 pointsWhy it works: Context that defines how things must be done is more valuable than context that shows one way to do it.
How to identify: Look for configuration files, architecture docs, security guidelines, comments explaining constraints.
Weighted Scoring: Combining Signals
No single signal is perfect. Recency matters but old code can be critical. Proximity matters but you need examples from elsewhere. Semantic similarity matters but embeddings are expensive.
The solution: weight multiple signals and combine them.
def rank_context(item, task, history):
score = 0
# Recency (20% weight)
recency_score = calculate_recency(item)
score += recency_score * 0.20
# Proximity (30% weight)
proximity_score = calculate_proximity(item, task)
score += proximity_score * 0.30
# Structural distance (20% weight)
structural_score = calculate_structural_distance(item, task)
score += structural_score * 0.20
# Semantic similarity (20% weight)
semantic_score = calculate_semantic_similarity(item, task)
score += semantic_score * 0.20
# Historical value (10% weight, if available)
if history:
historical_score = calculate_historical_value(item, history)
score += historical_score * 0.10
return scoreThe weights depend on your domain. For rapidly changing codebases, increase recency weight. For legacy systems where you're mostly reading existing code, decrease it. For highly structured systems, increase structural distance weight.
Don't guess at weights. Measure them. Log which items were actually used by successful agents and weight accordingly.
Token Budgeting: Allocating Portions of the Budget
You have, say, 100,000 tokens to work with. How do you allocate them?
The naive approach: rank all context by score and take the top items until you run out of tokens.
The sophisticated approach: allocate the budget across different context types, then rank within each type.
Budget Allocation Strategy
Total budget: 100,000 tokens
Allocation:
1. Task context: 20,000 tokens (20%)
- The user's request, previous context about this task
2. Core code context: 40,000 tokens (40%)
- The files being modified, the functions being called
3. Pattern/example context: 20,000 tokens (20%)
- Similar code, examples of how to do things right
4. Constraint context: 15,000 tokens (15%)
- Architecture docs, style guides, security requirements
5. Fallback/debugging: 5,000 tokens (5%)
- Reserved for unexpected needsWhy allocate by type? Different types of context serve different purposes:
- Task context helps the agent understand what to do
- Core code context helps it understand what exists
- Pattern context helps it match existing style
- Constraint context prevents mistakes
If you allocate 80,000 tokens to core code and only 5,000 to constraints, you get an agent that understands the code but violates your security requirements.
Budget Allocation for Different Task Types
Different tasks need different allocations:
Bug fix (agent needs to understand the problem):
Core code context: 50% (understand the bug)
Pattern context: 15%
Constraint context: 25% (avoid reintroducing the bug)
Fallback: 10%Feature addition (agent needs to understand patterns):
Core code context: 35% (understand the interface)
Pattern context: 35% (follow existing patterns)
Constraint context: 20%
Fallback: 10%Refactoring (agent needs to understand dependencies):
Core code context: 45% (understand dependencies)
Pattern context: 20%
Constraint context: 20%
Fallback: 15%Architectural change (agent needs constraints first):
Core code context: 30%
Pattern context: 10%
Constraint context: 50% (understand the new architecture)
Fallback: 10%Measure which allocation works for your typical tasks. Then use that as a template.
The Packing Problem: Maximizing Total Value
Once you've ranked context and allocated budgets, the question becomes: which items actually go in?
This is the knapsack problem. You have a knapsack of 100,000 tokens. You have items (files, docs, examples) each with a weight (token count) and value (relevance score). Maximize total value within the weight constraint.
The naive solution: rank by value and take items in order until the budget is full.
# Naive approach
ranked_items = rank_all_context(task)
packed = []
total_tokens = 0
for item in ranked_items:
if total_tokens + item.tokens <= budget:
packed.append(item)
total_tokens += item.tokensThis works but it's suboptimal. A high-scoring item that's 20,000 tokens might leave you unable to fit five medium-scoring items that total 15,000 tokens.
The better solution: use dynamic programming or greedy algorithms that account for value-to-token ratio.
# Better approach: value per token
ranked_items = rank_all_context(task)
items_by_efficiency = sorted(
ranked_items,
key=lambda x: x.score / x.tokens,
reverse=True
)
packed = []
total_tokens = 0
for item in items_by_efficiency:
if total_tokens + item.tokens <= budget:
packed.append(item)
total_tokens += item.tokensThis prioritizes items that give the most value per token, which is usually closer to optimal.
For really sophisticated systems, use full knapsack-solving algorithms. But the value-to-token ratio approach is "good enough" and much simpler.
Truncation vs Summarization vs Omission
When an item is important but too large, you have three options:
Option 1: Truncation
Just take the first N lines:
Full file: 5,000 tokens
Truncated to 500 tokens: Take lines 1-50
Pros: Dead simple, preserves code structure
Cons: Loses important details at the end, breaks functions mid-wayUse truncation for: log files, large lists, repetitive structures
Option 2: Summarization
Create a summary of the full item:
Full file (5,000 tokens):
class AuthManager:
def login(user, password):
# 50 lines of complex logic
def logout(user):
# 20 lines
def refresh_token(token):
# 40 lines
Summary (200 tokens):
"AuthManager handles user login/logout and token refresh.
Login validates credentials against the user DB and returns a signed JWT.
Logout invalidates the token in Redis. Refresh regenerates an expired token if <24h old."Pros: Preserves the important parts, compresses efficiently Cons: Requires generating summaries (cost/latency), loses implementation details
Use summarization for: large files, modules you understand conceptually but don't need to read, architecture documentation
Option 3: Omission
Just don't include it if it won't fit:
Pros: No trade-offs, simplest
Cons: Might miss important contextUse omission for: low-scoring items, redundant context, nice-to-have examples
Dynamic Budget Allocation: Adjusting Based on Task Complexity
More complex tasks need larger budgets. Simple tasks don't.
def calculate_budget(task, base_budget=100000):
complexity = estimate_task_complexity(task)
if complexity == "simple": # Single file change
return int(base_budget * 0.5)
elif complexity == "moderate": # Multi-file change
return base_budget
elif complexity == "complex": # Cross-module change
return int(base_budget * 1.5)
elif complexity == "architectural": # Major refactoring
return int(base_budget * 2.0)
def estimate_task_complexity(task):
# Count how many files the task likely touches
# Count how many modules
# Check if it involves database changes
# Check if it involves API changes
# etc.More complex tasks need more context. Allow the budget to grow. But measure whether that actually helps—you might find that more budget doesn't improve outcomes, which means you need better ranking, not more tokens.
Measuring Context Quality: How to Know If Your Ranking Works
Don't guess whether your ranking is good. Measure it.
Measurement 1: Task Success Rate
The core metric: does the agent succeed?
Baseline: Without intelligent ranking
- Success rate: 65%
- Average retries per task: 2.3
With intelligent ranking:
- Success rate: 85%
- Average retries per task: 1.1If success rate improves, your ranking is working.
Measurement 2: Context Efficiency
Are you using your token budget well?
Track, per task:
- Total tokens used
- Total tokens available
- Success rate
High-efficiency tasks:
- Used 40,000 tokens
- Budget was 100,000
- Success rate: 90%
Low-efficiency tasks:
- Used 95,000 tokens
- Budget was 100,000
- Success rate: 60%
Insight: You might not need a larger budget. You need better ranking.Measurement 3: Context Reuse
How often is the same context useful across multiple tasks?
Log what context each task fetches:
- auth.js: fetched in 95% of auth tasks
- models.py: fetched in 45% of all tasks
- style_guide.md: fetched in 80% of tasks
- obsolete_api.js: fetched in 5% of tasks
Insight: style_guide.md is consistently valuable, should be ranked higher.
obsolete_api.js is rarely useful, can be ranked lower.Measurement 4: Signal Correlation
Which ranking signals actually predict success?
For your past 100 tasks, correlate:
Recency vs success: -0.3 (weak negative - recency isn't important)
Proximity vs success: 0.8 (strong positive - proximity matters a lot)
Semantic similarity vs success: 0.6 (moderate positive)
Historical value vs success: 0.7 (strong positive)
Insight: Adjust your weights. Reduce recency weight, increase proximity weight.This is how you improve your ranking over time. Every task teaches you something about what signals matter.
The Lost In The Middle Effect and Ranking Order
There's an interesting quirk of LLM attention: information in the middle of a long context gets less attention than information at the beginning or end. This is the "lost in the middle" effect, which is also a key consideration when avoiding context overload.
When you rank context, consider not just value but position:
Token position analysis for an LLM:
- First 20% of tokens: ~90% attention
- Middle 60% of tokens: ~40% attention
- Last 20% of tokens: ~85% attention
Implication:
- Put the most critical context at the beginning or end
- Put nice-to-have context in the middle
- If you have to choose between more context and better positioning, choose better positioningIn practice:
def arrange_context(items):
# Separate by importance
critical = [i for i in items if i.score > 80]
important = [i for i in items if 50 < i.score <= 80]
nice_to_have = [i for i in items if i.score <= 50]
# Arrange: critical at start/end, nice-to-have in middle
arranged = critical[:len(critical)//2] # First half of critical
arranged += nice_to_have
arranged += critical[len(critical)//2:] # Second half of critical
arranged += important
return arrangedThis is a small optimization but it compounds. If critical context gets 90% attention instead of 40%, the agent is substantially better.
Real-World Example: Context Ranking for a Refactoring Task
Let's trace through a real example: "Refactor the payment processing module to support async operations."
Step 1: Identify Context Types
What could be relevant?
Type: Core Code
- payment.js (the file being modified): 2,000 tokens
- checkout.js (calls payment): 1,500 tokens
- order.js (uses checkout): 1,200 tokens
Type: Pattern
- async_utils.js (async patterns): 800 tokens
- database.js (async DB): 1,000 tokens
- existing_async_refactor.md (docs): 400 tokens
Type: Constraint
- architecture_docs.md (module boundaries): 600 tokens
- performance_requirements.md (latency targets): 400 tokens
- testing_strategy.md (how to test async): 500 tokens
Type: Example
- other_async_modules/*.js (5 files): 5,000 tokens
- old_payment_tests.js (existing tests): 2,000 tokensStep 2: Score Each Item
Using our scoring signals:
Proximity + Recency + Semantic Similarity + Historical Value:
payment.js: 100 + 80 + 95 + 90 = 365
checkout.js: 80 + 60 + 80 + 75 = 295
order.js: 70 + 40 + 70 + 60 = 240
async_utils.js: 70 + 50 + 90 + 85 = 295
database.js: 60 + 70 + 85 + 80 = 295
existing_async_refactor.md: 50 + 80 + 80 + 70 = 280
architecture_docs.md: 40 + 30 + 70 + 60 = 200
performance_requirements.md: 50 + 20 + 75 + 65 = 210
testing_strategy.md: 60 + 30 + 85 + 80 = 255
other_async_modules: 50 + 60 + 75 + 70 = 255
old_payment_tests.js: 90 + 80 + 70 + 80 = 320Step 3: Allocate Budget
Task: Refactoring (complex, multi-file change):
Budget: 100,000 tokens
Core code: 40% = 40,000 tokens
Pattern: 25% = 25,000 tokens
Constraint: 20% = 20,000 tokens
Examples: 15% = 15,000 tokensStep 4: Pack the Budget
Within each category, rank by efficiency and pack:
CORE CODE (40,000 tokens available):
1. payment.js (2,000 tokens, score 365) ← Include
2. checkout.js (1,500 tokens, score 295) ← Include
3. order.js (1,200 tokens, score 240) ← Include
Subtotal: 4,700 tokens. Still have 35,300.
Include other high-value core code...
PATTERN (25,000 tokens available):
1. async_utils.js (800 tokens, score 295) ← Include
2. database.js (1,000 tokens, score 295) ← Include
3. existing_async_refactor.md (400 tokens, score 280) ← Include
Subtotal: 2,200 tokens. Include examples...
CONSTRAINT (20,000 tokens available):
1. testing_strategy.md (500 tokens, score 255) ← Include
2. performance_requirements.md (400 tokens, score 210) ← Include
3. architecture_docs.md (600 tokens, score 200) ← Include
Subtotal: 1,500 tokens. Still have 18,500.
EXAMPLES (15,000 tokens available):
1. old_payment_tests.js (2,000 tokens, score 320) ← Include
2. other_async_modules (5,000 tokens, score 255) ← Include
Subtotal: 7,000 tokens.Final context: ~15,400 tokens out of 100,000 available.
The agent has plenty of room for deep reasoning, and you've included only the highest-value items.
Integration With Bitloops
Ranking and token budgeting is where Bitloops shines. Instead of building ranking logic for each agent independently, Bitloops provides:
- Signal extraction: Automatically compute recency, proximity, semantic similarity across your codebase
- Scoring models: Pre-trained models that combine signals into relevance scores, tuned for coding tasks
- Budget allocation: Automatically allocate tokens based on task complexity and historical performance
- Measurement: Continuous measurement of what context is actually useful, feeding back into ranking improvements
- Integration: Standard interfaces that any agent (Claude, GPT, open-source) can use to fetch ranked context
Instead of each agent team inventing their own ranking system, you build once with Bitloops and every agent benefits from continuous learning about what matters.
FAQ
Don't you need embeddings for semantic similarity scoring?
It helps but isn't required. You can rank effectively with just proximity, recency, and structural distance. Embeddings are an optimization that improves accuracy in complex codebases.
Should I weight all signals equally?
No. Measure which signals correlate with task success in your codebase, then weight accordingly. A legacy system might weight recency less. A fast-moving system might weight it more.
What if my budget is very small (say, 20,000 tokens)?
Increase the weights on structural distance and constraint context. Decrease pattern examples. Prioritize high-confidence items. You'll need tighter ranking, but it's still better than random selection.
How often should I re-measure and adjust weights?
After every 50 tasks, recalculate signal correlations and adjust weights. After 100 tasks, you'll have enough data to feel confident. After that, measure quarterly or when you notice changes in task success rates.
What if two items have the same score but I can only fit one?
Use a tiebreaker: prefer items that are smaller (value per token), or items that appeared in more past successful tasks. Or include both partially (truncate or summarize).
Can I automate this? Do I really need to manually rank everything?
Yes, absolutely automate it. Your tool calling system should automatically rank context before fetching. The human role is to measure outcomes and adjust weights, not to manually rank each request.
What's the minimum number of tasks needed to trust my ranking weights?
100 tasks gives you reasonable confidence. 500 tasks gives you very high confidence. Before 50 tasks, your weights are mostly guesses—don't trust them fully.
How does ranking interact with agent planning?
Good agents don't just accept whatever context you give them. They look at what context they have, identify gaps, and ask for more. Ranking gets them 80% of the way there, but the agent's judgment matters for the last 20%.
Should I rank differently based on the agent model (Claude vs GPT vs open source)?
Maybe. Different models have different attention patterns and different strengths. If you notice that GPT succeeds with less constraint context but more examples, adjust the allocation for GPT. But start with one ranking scheme and adjust based on measurement.
Primary Sources
- Analysis of attention patterns showing models perform worse on information in middle of long contexts. Lost in the Middle
- Foundational work on transformer architecture, essential to understanding context window mechanics. Attention Is All You Need
- Combines retrieval with generation to augment model knowledge for knowledge-intensive tasks. RAG Paper
- Demonstrates interleaving reasoning and acting as a framework for agent task solving. ReAct
- Explores multiple reasoning paths through tree-structured prompting for complex problem solving. Tree of Thoughts
- Practical guide for implementing retrieval systems in language model applications. LangChain Retrieval
More in this hub
Context Ranking and Token Budgeting
7 / 10Previous
Article 6
Avoiding Context Overload in AI Agents: Smart Loading Strategies
Next
Article 8
When Should an Agent Fetch Context? Decision Timing and Cost Analysis
Also in this hub
Get Started with Bitloops.
Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.
curl -sSL https://bitloops.com/install.sh | bash