Context Ranking and Token Budgeting

You have more relevant context than fits in the window. Your codebase is 50,000 files. The user asks "refactor the auth module." Everything is potentially relevant—the auth module itself, tests, dependencies, examples, architecture docs, conventions, security policies. But you have maybe 100,000 tokens to work with. What gets in?

This is the context ranking problem, and it's where agents either succeed or fail. Get the ranking wrong and the agent produces garbage despite having "enough context." Get it right and the agent produces excellent work using less than half the available budget.

Token budgeting isn't just optimization. It's a fundamental architectural problem that affects agent quality, cost, latency, and user experience. This article teaches you how to think about it.

The Core Problem: Value Per Token

The fundamental question is: which context has the highest value per token?

Not "which context is most relevant?" That's easy—fetch everything relevant.

Not "which context is largest?" That's the wrong signal.

The real question is: "which context, if included, most improves the agent's ability to complete this task?"

This is an information theory problem. You're optimizing signal-to-noise. Each token you include costs money and potentially adds noise (distracting the model from important information). Each token you exclude costs nothing but might cause the agent to miss a critical constraint.

The goal is to maximize the probability that the agent succeeds with a fixed token budget.

Context Ranking Signals

There are multiple signals that indicate whether context is valuable for a task. A sophisticated ranking system combines these signals into a single relevance score.

Signal 1: Recency

How recently was this context modified or accessed?

Scoring:
- Modified in the last commit: 100 points
- Modified in the last week: 80 points
- Modified in the last month: 60 points
- Modified in the last 6 months: 40 points
- Hasn't changed in over 6 months: 20 points

YAML

Why it works: Recently modified code is more likely to be relevant. If someone changed the auth module last week and you're now asked to refactor it, that's no coincidence.

Edge case: Don't weight recency too heavily. Old code can be critical. The HTTP library might not change for years but it's essential for any network request.

Signal 2: Proximity to Task

How close is this context to what the user asked for?

Scoring (for a task about "auth module"):
- Exact match (auth module itself): 100 points
- Direct dependency (uses auth module): 80 points
- Reverse dependency (imported by auth module): 70 points
- Same package/module: 60 points
- One level removed: 40 points
- Two+ levels removed: 20 points

Text

Why it works: The auth module itself is obviously relevant. Code that calls the auth module is probably relevant. Code that the auth module calls might be relevant. Code that imports the auth module three levels up is probably not.

How to calculate: Use the dependency graph. If you don't have one, build one by parsing imports.

Signal 3: Structural Distance

How far away is this in the codebase structure?

Scoring (for editing src/auth/login.js):
- Same directory: 100 points
- Same package (src/auth/): 90 points
- Same parent (src/): 70 points
- Related package (src/models/): 50 points
- Different package: 30 points
- Config/infrastructure: 40 points

Text

Why it works: Code physically near the target file is more likely to interact with it, follow the same conventions, and be part of the same logical unit.

How to calculate: Count directory traversals. The fewer traversals needed, the higher the score.

Signal 4: Semantic Similarity

How similar is the context to the task semantically? This connects to how agents learn meaning from their environment through semantic context.

This requires semantic search (usually embedding-based):

Example: Task is "add OAuth2 authentication"
- Existing OAuth2 implementation: 95% similarity
- Other auth methods: 80% similarity
- Token handling code: 70% similarity
- User model: 60% similarity
- HTTP utilities: 40% similarity
- Database migrations: 30% similarity

YAML

Why it works: If you're implementing OAuth2, studying an existing OAuth2 implementation teaches you more than a generic auth pattern.

How to implement: Use embeddings (OpenAI, Anthropic, or open-source embedding models). Embed the task description and all context. Rank by cosine similarity.

Cost: Embedding everything is expensive upfront, but you do it once and cache the results.

Signal 5: Historical Value

How often has this context been useful for similar past tasks?

Scoring (based on past usage):
- Used in 100% of similar tasks: 90 points
- Used in 80% of similar tasks: 80 points
- Used in 50% of similar tasks: 60 points
- Used in 20% of similar tasks: 40 points
- Rarely used: 20 points

Text

Why it works: If your team always fetches the style guide when working on auth, the style guide is valuable. If the style guide is rarely fetched for other tasks, it's less valuable.

How to implement: Log what context is fetched for each task and its outcome. After 100 tasks, measure which context was most useful. Use that to weight future decisions.

This is your feedback loop. It's how your system learns.

Signal 6: Constraint or Pattern Density

How much constraint or pattern information is in this context?

Scoring:
- Configuration file that defines behavior: 95 points
- Code with unusual error handling: 85 points
- Architecture document: 80 points
- Security policy document: 85 points
- Example code showing a pattern: 70 points
- Generic utility function: 40 points
- Comment explaining why something works: 75 points

javascript

Why it works: Context that defines how things must be done is more valuable than context that shows one way to do it.

How to identify: Look for configuration files, architecture docs, security guidelines, comments explaining constraints.

Weighted Scoring: Combining Signals

No single signal is perfect. Recency matters but old code can be critical. Proximity matters but you need examples from elsewhere. Semantic similarity matters but embeddings are expensive.

The solution: weight multiple signals and combine them.

def rank_context(item, task, history):
    score = 0
    
    # Recency (20% weight)
    recency_score = calculate_recency(item)
    score += recency_score * 0.20
    
    # Proximity (30% weight)
    proximity_score = calculate_proximity(item, task)
    score += proximity_score * 0.30
    
    # Structural distance (20% weight)
    structural_score = calculate_structural_distance(item, task)
    score += structural_score * 0.20
    
    # Semantic similarity (20% weight)
    semantic_score = calculate_semantic_similarity(item, task)
    score += semantic_score * 0.20
    
    # Historical value (10% weight, if available)
    if history:
        historical_score = calculate_historical_value(item, history)
        score += historical_score * 0.10
    
    return score

Python

The weights depend on your domain. For rapidly changing codebases, increase recency weight. For legacy systems where you're mostly reading existing code, decrease it. For highly structured systems, increase structural distance weight.

Don't guess at weights. Measure them. Log which items were actually used by successful agents and weight accordingly.

Token Budgeting: Allocating Portions of the Budget

You have, say, 100,000 tokens to work with. How do you allocate them?

The naive approach: rank all context by score and take the top items until you run out of tokens.

The sophisticated approach: allocate the budget across different context types, then rank within each type.

Budget Allocation Strategy

Total budget: 100,000 tokens
Allocation:

1. Task context: 20,000 tokens (20%)
   - The user's request, previous context about this task

2. Core code context: 40,000 tokens (40%)
   - The files being modified, the functions being called
   
3. Pattern/example context: 20,000 tokens (20%)
   - Similar code, examples of how to do things right
   
4. Constraint context: 15,000 tokens (15%)
   - Architecture docs, style guides, security requirements
   
5. Fallback/debugging: 5,000 tokens (5%)
   - Reserved for unexpected needs

Python

Why allocate by type? Different types of context serve different purposes:

Task context helps the agent understand what to do
Core code context helps it understand what exists
Pattern context helps it match existing style
Constraint context prevents mistakes

If you allocate 80,000 tokens to core code and only 5,000 to constraints, you get an agent that understands the code but violates your security requirements.

Budget Allocation for Different Task Types

Different tasks need different allocations:

Bug fix (agent needs to understand the problem):

Core code context: 50% (understand the bug)
Pattern context: 15%
Constraint context: 25% (avoid reintroducing the bug)
Fallback: 10%

Text

Feature addition (agent needs to understand patterns):

Core code context: 35% (understand the interface)
Pattern context: 35% (follow existing patterns)
Constraint context: 20%
Fallback: 10%

Text

Refactoring (agent needs to understand dependencies):

Core code context: 45% (understand dependencies)
Pattern context: 20%
Constraint context: 20%
Fallback: 15%

Text

Architectural change (agent needs constraints first):

Core code context: 30%
Pattern context: 10%
Constraint context: 50% (understand the new architecture)
Fallback: 10%

Text

Measure which allocation works for your typical tasks. Then use that as a template.

The Packing Problem: Maximizing Total Value

Once you've ranked context and allocated budgets, the question becomes: which items actually go in?

This is the knapsack problem. You have a knapsack of 100,000 tokens. You have items (files, docs, examples) each with a weight (token count) and value (relevance score). Maximize total value within the weight constraint.

The naive solution: rank by value and take items in order until the budget is full.

# Naive approach
ranked_items = rank_all_context(task)
packed = []
total_tokens = 0

for item in ranked_items:
    if total_tokens + item.tokens <= budget:
        packed.append(item)
        total_tokens += item.tokens

Python

This works but it's suboptimal. A high-scoring item that's 20,000 tokens might leave you unable to fit five medium-scoring items that total 15,000 tokens.

The better solution: use dynamic programming or greedy algorithms that account for value-to-token ratio.

# Better approach: value per token
ranked_items = rank_all_context(task)
items_by_efficiency = sorted(
    ranked_items,
    key=lambda x: x.score / x.tokens,
    reverse=True
)

packed = []
total_tokens = 0

for item in items_by_efficiency:
    if total_tokens + item.tokens <= budget:
        packed.append(item)
        total_tokens += item.tokens

Python

This prioritizes items that give the most value per token, which is usually closer to optimal.

For really sophisticated systems, use full knapsack-solving algorithms. But the value-to-token ratio approach is "good enough" and much simpler.

Truncation vs Summarization vs Omission

When an item is important but too large, you have three options:

Option 1: Truncation

Just take the first N lines:

Full file: 5,000 tokens
Truncated to 500 tokens: Take lines 1-50

Pros: Dead simple, preserves code structure
Cons: Loses important details at the end, breaks functions mid-way

Text

Use truncation for: log files, large lists, repetitive structures

Option 2: Summarization

Create a summary of the full item:

Full file (5,000 tokens):
class AuthManager:
    def login(user, password):
        # 50 lines of complex logic
    def logout(user):
        # 20 lines
    def refresh_token(token):
        # 40 lines

Summary (200 tokens):
"AuthManager handles user login/logout and token refresh. 
Login validates credentials against the user DB and returns a signed JWT. 
Logout invalidates the token in Redis. Refresh regenerates an expired token if <24h old."

Python

Pros: Preserves the important parts, compresses efficiently Cons: Requires generating summaries (cost/latency), loses implementation details

Use summarization for: large files, modules you understand conceptually but don't need to read, architecture documentation

Option 3: Omission

Just don't include it if it won't fit:

Pros: No trade-offs, simplest
Cons: Might miss important context

Text

Use omission for: low-scoring items, redundant context, nice-to-have examples

Dynamic Budget Allocation: Adjusting Based on Task Complexity

More complex tasks need larger budgets. Simple tasks don't.

def calculate_budget(task, base_budget=100000):
    complexity = estimate_task_complexity(task)
    
    if complexity == "simple":  # Single file change
        return int(base_budget * 0.5)
    elif complexity == "moderate":  # Multi-file change
        return base_budget
    elif complexity == "complex":  # Cross-module change
        return int(base_budget * 1.5)
    elif complexity == "architectural":  # Major refactoring
        return int(base_budget * 2.0)

def estimate_task_complexity(task):
    # Count how many files the task likely touches
    # Count how many modules
    # Check if it involves database changes
    # Check if it involves API changes
    # etc.

Python

More complex tasks need more context. Allow the budget to grow. But measure whether that actually helps—you might find that more budget doesn't improve outcomes, which means you need better ranking, not more tokens.

Measuring Context Quality: How to Know If Your Ranking Works

Don't guess whether your ranking is good. Measure it.

Measurement 1: Task Success Rate

The core metric: does the agent succeed?

Baseline: Without intelligent ranking
- Success rate: 65%
- Average retries per task: 2.3

With intelligent ranking:
- Success rate: 85%
- Average retries per task: 1.1

YAML

If success rate improves, your ranking is working.

Measurement 2: Context Efficiency

Are you using your token budget well?

Track, per task:
- Total tokens used
- Total tokens available
- Success rate

High-efficiency tasks:
- Used 40,000 tokens
- Budget was 100,000
- Success rate: 90%

Low-efficiency tasks:
- Used 95,000 tokens
- Budget was 100,000
- Success rate: 60%

Insight: You might not need a larger budget. You need better ranking.

Text

Measurement 3: Context Reuse

How often is the same context useful across multiple tasks?

Log what context each task fetches:
- auth.js: fetched in 95% of auth tasks
- models.py: fetched in 45% of all tasks
- style_guide.md: fetched in 80% of tasks
- obsolete_api.js: fetched in 5% of tasks

Insight: style_guide.md is consistently valuable, should be ranked higher.
         obsolete_api.js is rarely useful, can be ranked lower.

Text

Measurement 4: Signal Correlation

Which ranking signals actually predict success?

For your past 100 tasks, correlate:

Recency vs success: -0.3 (weak negative - recency isn't important)
Proximity vs success: 0.8 (strong positive - proximity matters a lot)
Semantic similarity vs success: 0.6 (moderate positive)
Historical value vs success: 0.7 (strong positive)

Insight: Adjust your weights. Reduce recency weight, increase proximity weight.

Text

This is how you improve your ranking over time. Every task teaches you something about what signals matter.

The Lost In The Middle Effect and Ranking Order

There's an interesting quirk of LLM attention: information in the middle of a long context gets less attention than information at the beginning or end. This is the "lost in the middle" effect, which is also a key consideration when avoiding context overload.

When you rank context, consider not just value but position:

Token position analysis for an LLM:
- First 20% of tokens: ~90% attention
- Middle 60% of tokens: ~40% attention
- Last 20% of tokens: ~85% attention

Implication:
- Put the most critical context at the beginning or end
- Put nice-to-have context in the middle
- If you have to choose between more context and better positioning, choose better positioning

Text

In practice:

def arrange_context(items):
    # Separate by importance
    critical = [i for i in items if i.score > 80]
    important = [i for i in items if 50 < i.score <= 80]
    nice_to_have = [i for i in items if i.score <= 50]
    
    # Arrange: critical at start/end, nice-to-have in middle
    arranged = critical[:len(critical)//2]  # First half of critical
    arranged += nice_to_have
    arranged += critical[len(critical)//2:]  # Second half of critical
    arranged += important
    
    return arranged

Python

This is a small optimization but it compounds. If critical context gets 90% attention instead of 40%, the agent is substantially better.

Real-World Example: Context Ranking for a Refactoring Task

Let's trace through a real example: "Refactor the payment processing module to support async operations."

Step 1: Identify Context Types

What could be relevant?

Type: Core Code
- payment.js (the file being modified): 2,000 tokens
- checkout.js (calls payment): 1,500 tokens
- order.js (uses checkout): 1,200 tokens

Type: Pattern
- async_utils.js (async patterns): 800 tokens
- database.js (async DB): 1,000 tokens
- existing_async_refactor.md (docs): 400 tokens

Type: Constraint
- architecture_docs.md (module boundaries): 600 tokens
- performance_requirements.md (latency targets): 400 tokens
- testing_strategy.md (how to test async): 500 tokens

Type: Example
- other_async_modules/*.js (5 files): 5,000 tokens
- old_payment_tests.js (existing tests): 2,000 tokens

javascript

Step 2: Score Each Item

Using our scoring signals:

Proximity + Recency + Semantic Similarity + Historical Value:

payment.js: 100 + 80 + 95 + 90 = 365
checkout.js: 80 + 60 + 80 + 75 = 295
order.js: 70 + 40 + 70 + 60 = 240

async_utils.js: 70 + 50 + 90 + 85 = 295
database.js: 60 + 70 + 85 + 80 = 295
existing_async_refactor.md: 50 + 80 + 80 + 70 = 280

architecture_docs.md: 40 + 30 + 70 + 60 = 200
performance_requirements.md: 50 + 20 + 75 + 65 = 210
testing_strategy.md: 60 + 30 + 85 + 80 = 255

other_async_modules: 50 + 60 + 75 + 70 = 255
old_payment_tests.js: 90 + 80 + 70 + 80 = 320

Text

Step 3: Allocate Budget

Task: Refactoring (complex, multi-file change):

Budget: 100,000 tokens

Core code: 40% = 40,000 tokens
Pattern: 25% = 25,000 tokens
Constraint: 20% = 20,000 tokens
Examples: 15% = 15,000 tokens

YAML

Step 4: Pack the Budget

Within each category, rank by efficiency and pack:

CORE CODE (40,000 tokens available):
1. payment.js (2,000 tokens, score 365) ← Include
2. checkout.js (1,500 tokens, score 295) ← Include
3. order.js (1,200 tokens, score 240) ← Include
Subtotal: 4,700 tokens. Still have 35,300.
Include other high-value core code...

PATTERN (25,000 tokens available):
1. async_utils.js (800 tokens, score 295) ← Include
2. database.js (1,000 tokens, score 295) ← Include
3. existing_async_refactor.md (400 tokens, score 280) ← Include
Subtotal: 2,200 tokens. Include examples...

CONSTRAINT (20,000 tokens available):
1. testing_strategy.md (500 tokens, score 255) ← Include
2. performance_requirements.md (400 tokens, score 210) ← Include
3. architecture_docs.md (600 tokens, score 200) ← Include
Subtotal: 1,500 tokens. Still have 18,500.

EXAMPLES (15,000 tokens available):
1. old_payment_tests.js (2,000 tokens, score 320) ← Include
2. other_async_modules (5,000 tokens, score 255) ← Include
Subtotal: 7,000 tokens.

YAML

Final context: ~15,400 tokens out of 100,000 available.

The agent has plenty of room for deep reasoning, and you've included only the highest-value items.

Integration With Bitloops

Ranking and token budgeting is where Bitloops shines. Instead of building ranking logic for each agent independently, Bitloops provides:

Signal extraction: Automatically compute recency, proximity, semantic similarity across your codebase
Scoring models: Pre-trained models that combine signals into relevance scores, tuned for coding tasks
Budget allocation: Automatically allocate tokens based on task complexity and historical performance
Measurement: Continuous measurement of what context is actually useful, feeding back into ranking improvements
Integration: Standard interfaces that any agent (Claude, GPT, open-source) can use to fetch ranked context

Instead of each agent team inventing their own ranking system, you build once with Bitloops and every agent benefits from continuous learning about what matters.

FAQ

Don't you need embeddings for semantic similarity scoring?

It helps but isn't required. You can rank effectively with just proximity, recency, and structural distance. Embeddings are an optimization that improves accuracy in complex codebases.

Should I weight all signals equally?

No. Measure which signals correlate with task success in your codebase, then weight accordingly. A legacy system might weight recency less. A fast-moving system might weight it more.

What if my budget is very small (say, 20,000 tokens)?

Increase the weights on structural distance and constraint context. Decrease pattern examples. Prioritize high-confidence items. You'll need tighter ranking, but it's still better than random selection.

How often should I re-measure and adjust weights?

After every 50 tasks, recalculate signal correlations and adjust weights. After 100 tasks, you'll have enough data to feel confident. After that, measure quarterly or when you notice changes in task success rates.

What if two items have the same score but I can only fit one?

Use a tiebreaker: prefer items that are smaller (value per token), or items that appeared in more past successful tasks. Or include both partially (truncate or summarize).

Can I automate this? Do I really need to manually rank everything?

Yes, absolutely automate it. Your tool calling system should automatically rank context before fetching. The human role is to measure outcomes and adjust weights, not to manually rank each request.

What's the minimum number of tasks needed to trust my ranking weights?

100 tasks gives you reasonable confidence. 500 tasks gives you very high confidence. Before 50 tasks, your weights are mostly guesses—don't trust them fully.

How does ranking interact with agent planning?

Good agents don't just accept whatever context you give them. They look at what context they have, identify gaps, and ask for more. Ranking gets them 80% of the way there, but the agent's judgment matters for the last 20%.

Should I rank differently based on the agent model (Claude vs GPT vs open source)?

Maybe. Different models have different attention patterns and different strengths. If you notice that GPT succeeds with less constraint context but more examples, adjust the allocation for GPT. But start with one ranking scheme and adjust based on measurement.

Primary Sources

Analysis of attention patterns showing models perform worse on information in middle of long contexts. Lost in the Middle
Foundational work on transformer architecture, essential to understanding context window mechanics. Attention Is All You Need
Combines retrieval with generation to augment model knowledge for knowledge-intensive tasks. RAG Paper
Demonstrates interleaving reasoning and acting as a framework for agent task solving. ReAct
Explores multiple reasoning paths through tree-structured prompting for complex problem solving. Tree of Thoughts
Practical guide for implementing retrieval systems in language model applications. LangChain Retrieval

The Core Problem: Value Per Token

Context Ranking Signals

Signal 1: Recency

Signal 2: Proximity to Task

Signal 3: Structural Distance

Signal 4: Semantic Similarity

Signal 5: Historical Value

Signal 6: Constraint or Pattern Density

Weighted Scoring: Combining Signals

Token Budgeting: Allocating Portions of the Budget

Budget Allocation Strategy

Budget Allocation for Different Task Types

The Packing Problem: Maximizing Total Value

Truncation vs Summarization vs Omission

Option 1: Truncation

Option 2: Summarization

Option 3: Omission

Dynamic Budget Allocation: Adjusting Based on Task Complexity

Measuring Context Quality: How to Know If Your Ranking Works

Measurement 1: Task Success Rate

Measurement 2: Context Efficiency

Measurement 3: Context Reuse

Measurement 4: Signal Correlation

The Lost In The Middle Effect and Ranking Order

Real-World Example: Context Ranking for a Refactoring Task

Step 1: Identify Context Types

Step 2: Score Each Item

Step 3: Allocate Budget

Step 4: Pack the Budget

Integration With Bitloops

FAQ

Don't you need embeddings for semantic similarity scoring?

Should I weight all signals equally?

What if my budget is very small (say, 20,000 tokens)?

How often should I re-measure and adjust weights?

What if two items have the same score but I can only fit one?

Can I automate this? Do I really need to manually rank everything?

What's the minimum number of tasks needed to trust my ranking weights?

How does ranking interact with agent planning?

Should I rank differently based on the agent model (Claude vs GPT vs open source)?

Primary Sources

More in this hub

Get Started with Bitloops.