The Modern AI Development Stack: From Models to Production Agent Infrastructure
Every AI agent runs on a stack: models, inference, tool calling, context management, orchestration, governance, observability. Most teams build accidentally. This maps the layers, what's mature vs. emerging, and the architectural choices that matter.
The Stack Exists Whether You Acknowledge It Or Not
Every time you run an AI agent, you're using a software stack. It has layers—from the raw model at the bottom, through tool-calling infrastructure, context management, orchestration, governance, and monitoring at the top. Most teams don't consciously design this stack; they just fall into it by using tools that already exist.
The problem is that accidental stacks are slow and fragile. You're using Cursor because it has IDE integration, Claude Code because it has better reasoning, Anthropic's API for the underlying model, and a custom script you wrote to connect everything together. None of these components were designed to work together. They're cobbled. Understanding tooling ecosystems helps you make better architectural choices.
Understanding the AI development stack lets you make intentional architectural choices instead of falling into whatever existed first.
The Layers of the Stack
From bottom to top:
Layer 1: Models
The foundation. The large language models that power everything.
Current State:
- Proprietary models dominate (Claude, GPT-4, Gemini)
- Open-source models exist but require serious infrastructure (Llama 2, Mistral, code-specific models)
- Inference costs are dropping but still material
- Model capabilities are becoming commoditized—soon you won't choose based on raw quality
Architectural choices:
- Hosted vs. Local: Use hosted APIs (Anthropic, OpenAI, Google) for convenience and access to the latest models. Use local models for data privacy, lower latency, and cost at scale.
- Single model vs. Multi-model: Use one model and get good at prompt engineering. Or use specialized models for specialized tasks (one for code generation, one for analysis, one for testing).
- Proprietary vs. Open-Source: Proprietary models are ahead on capability but locked in. Open-source models are behind on capability but you own the infrastructure.
What's mature: Hosted APIs. They work reliably, handle scale, and you don't think about it.
What's emerging: Open-source models reaching parity on specific tasks (code generation, code analysis). Local inference infrastructure (vLLM, TGI) making it feasible to run models on premise.
Layer 2: Inference Infrastructure
Getting the model to produce outputs quickly and reliably.
Current State:
- If you're using hosted APIs, this is handled for you
- If you're running local models, you're using inference engines (vLLM, Ollama, Text Generation Inference)
- Optimizations are becoming important (quantization, speculative decoding, KV cache optimization)
- Cost is directly tied to inference efficiency
Architectural choices:
- Edge vs. Cloud: Run inference on devices (edge) for latency and privacy, or in cloud for scale and simplicity
- Optimization level: Standard inference vs. quantized vs. speculative decoding. More optimization = lower cost but potentially worse quality
- Batching: Single request at a time vs. batching requests for efficiency
What's mature: Cloud hosted inference (APIs). Edge inference for certain models (smaller LLMs on devices). Batching for high-volume scenarios.
What's emerging: Speculative decoding (predicting tokens in parallel). Efficient inference frameworks that make local deployment more viable.
Layer 3: Tool Calling and Agent Scaffolding
How agents decide what to do and invoke tools.
Current State:
- Every major model supports tool calling (function calling, tool use, whatever you call it)
- Tool definitions are standardized (JSON schema)
- The Model Context Protocol (MCP) is becoming the standard for tool integration
- Each agent implementation has slightly different tool-calling patterns
Architectural choices:
- In-context Tools vs. Parameter Tools: Define tools in the prompt (in-context) or in a structured format (parameter tools). Parameter is cleaner and more reliable.
- MCP vs. Custom Tools: Use MCP for interoperability and standard tool definitions. Use custom tools if you need special semantics. See Designing Pluggable Tools for best practices.
- Tool Validation: Validate tool arguments before execution. Don't let the agent call tools with invalid arguments.
What's mature: Basic tool calling works reliably. MCP is stabilizing as a standard.
What's emerging: Advanced agent control (preventing agent loop thrashing, better reasoning about which tools to use). Better error handling and tool failure recovery.
Layer 4: Context Engines
Managing what information the agent knows about and can use.
Current State:
- No standard yet (this is the gap)
- Each agent maintains context independently
- Context is expensive (large context windows, expensive inference on long context)
- Retrieval-augmented generation (RAG) is the dominant pattern for adding context
Architectural choices:
- Stateless vs. Stateful: Stateless agents are simple but require full context per request. Stateful agents maintain context but are more complex.
- Vector Search vs. Structured Retrieval: Use semantic search (vectors, embeddings) to find relevant context. Use structured retrieval (database queries) when you know what you're looking for.
- Context Compression: Summarize or compress context to fit in smaller windows (cheaper, faster).
What's mature: Vector search (embeddings + similarity search). Basic RAG.
What's emerging: Agentic context management (agents deciding what context they need vs. static retrieval). Shared context engines like Bitloops. Context versioning and rollback.
Layer 5: Agent Frameworks
Orchestrating model invocation, tool calling, and reasoning loops.
Current State:
- LangChain is the dominant framework (Python)
- LangGraph adds better reasoning and planning (still Python-focused)
- CrewAI for multi-agent orchestration (newer, less mature)
- Many agent implementations are tied to specific IDEs (Cursor, Claude Code) rather than frameworks
Architectural choices:
- Chain vs. Graph: Chains are simple sequences of steps. Graphs allow branching, loops, and conditional logic. Start with chains, graduate to graphs as complexity grows.
- Function Calling vs. Reasoning Loops: Function calling means the agent immediately calls a tool. Reasoning loops mean the agent thinks before deciding. Reasoning loops produce better decisions at the cost of more tokens.
- Abstractions Over APIs: Use framework abstractions that work with multiple models, or tightly couple to one model's API. Abstractions are more portable; tight coupling is more powerful.
What's mature: Basic agent loops. LangChain and LangGraph are production-ready.
What's emerging: Better reasoning integration (chain-of-thought, tree-of-thought). Multi-agent coordination. Agent memory and learning.
Layer 6: Orchestration
Coordinating multiple agents, managing workflows, handling failures.
Current State:
- Workflow engines (Temporal, Prefect, Airflow) exist but weren't designed for agents
- Agent-specific orchestration is still being built
- Most teams use simple patterns (sequential steps, no branching)
Architectural choices:
- Deterministic vs. Non-Deterministic: Traditional workflows are deterministic (same input, same path). Agent workflows are non-deterministic (same input, different reasoning path). Handle this with built-in replay and idempotency.
- Approval Gates: Add human review at critical points (before deploying code, before production database modifications)
- Retry Logic: Agents sometimes fail because they got unlucky. Automatic retries with different context can help.
What's mature: Traditional workflow orchestration (Temporal, Prefect). Using these for agent workflows is a matter of wrapping agents as tasks.
What's emerging: Agent-first orchestration (building for agent semantics rather than adapting existing tools). Better handling of agent failures and non-determinism.
Layer 7: Governance and Safety
Policies, permissions, compliance, preventing bad outcomes.
Current State:
- Limited governance tooling. Most teams implement this per-agent.
- Observability platforms are starting to add governance features
- Audit and compliance are manual in most organizations
- Security models are ad-hoc
Architectural choices:
- Centralized vs. Distributed Policy: Central policy engine (easier to maintain) vs. policies distributed to each agent (easier to customize)
- Allow vs. Deny: Allowlist (only allow specific actions) vs. blocklist (block known bad actions). Allowlist is more secure but more restrictive.
- Static vs. Dynamic Policy: Policies that don't change during execution vs. policies that can evolve based on agent behavior
What's mature: Not much. This is the frontier.
What's emerging: Governance platforms and policy engines. Integration with observability for enforcement.
Layer 8: Observability and Monitoring
Seeing what agents do, measuring performance, debugging failures.
Current State:
- Traditional observability (logs, metrics, traces) doesn't cut it for agents
- Agent-specific observability is being built (LangSmith, Whylabs, custom platforms)
- Observability is fragmented by platform (each agent has its own logging)
Architectural choices:
- Centralized vs. Distributed Observability: Central platform (easier to correlate across agents) vs. agent-native logging (easier for each agent to implement)
- What to Observe: Log all tool calls? Only failures? Agent reasoning? All are important but have different costs.
- Data Retention: How long to keep logs? Cheaper to delete quickly; harder to debug old issues.
What's mature: Logs and basic metrics. Tracing for individual requests.
What's emerging: Agent-specific observability (decision tracing, reasoning assessment). Cross-agent visibility.
How This Maps to Traditional Development Stacks
The AI stack has parallels to traditional development stacks.
Traditional Stack:
Layered architecture
Application Logic
Framework
Runtime / Language
Operating System
Hardware
AI Stack:
Layered architecture
Observability
Governance
Orchestration
Agent Frameworks
Context Engines
Tool Calling
Inference Infra
Models
The principle is the same: each layer provides abstractions to the layer above. The model layer doesn't care how inference is optimized. The tool calling layer doesn't care which model is underneath. Stacking layers lets teams focus on different concerns.
What's Mature vs. What's Still Emerging
Mature (use with confidence):
- Model APIs (Claude, GPT-4, Gemini) work reliably
- Tool calling works well enough for most cases
- Basic agent loops are solid
- Traditional observability is mature
- Vector search for context retrieval works
Approaching Mature:
- Tool calling standardization via MCP
- Agent frameworks (LangChain, LangGraph)
- Orchestration platforms (Temporal)
- Context management and RAG
Still Emerging:
- Governance and policy enforcement
- Agent-specific observability (reasoning tracing, decision quality)
- Shared context engines
- Multi-agent coordination
- Non-deterministic workflow semantics
This matters because you should only bet on immature technology if you have time to adapt. If you need stability, stick to mature layers.
The Key Architectural Decisions Teams Face
When building on the AI stack, you face these choices:
1. Hosted vs. Self-Hosted Models
Hosted (OpenAI, Anthropic, Google):
- Pros: Latest models, handles scale, no ops burden
- Cons: Vendor lock-in, recurring cost, latency, data privacy
Self-Hosted (Local Models):
- Pros: No vendor lock-in, better latency, data privacy, lower cost at scale
- Cons: Infrastructure burden, slightly lower quality, limited to smaller models
Decision: Use hosted for speed and capability. Use self-hosted for cost and privacy at scale. Most teams use hosted initially, graduate to self-hosted or hybrid as volume grows.
2. Proprietary vs. Open-Source Agents
Proprietary (Claude Code, Cursor, Copilot):
- Pros: Best in class, tight IDE integration, support
- Cons: Vendor lock-in, less visibility into behavior
Open-Source (Aider, Continue, local agents):
- Pros: No lock-in, full visibility, customizable
- Cons: Fewer capabilities, more infrastructure work
Decision: Use proprietary agents for the best capabilities. Use open-source for customization and avoiding lock-in. Most teams use multiple agents.
3. Centralized vs. Distributed Infrastructure
Centralized:
- Pros: Consistent policies, easier to manage, single source of truth
- Cons: Bottleneck, slower to innovate, one outage affects everything
Distributed:
- Pros: Flexibility, faster iteration, isolated failures
- Cons: Harder to coordinate, inconsistent policies, more complexity
Decision: Start centralized for simplicity. Move to distributed when bottlenecks appear (usually at 10+ agents).
4. Build vs. Buy for Each Layer
Buy (use existing tools):
- Faster to value
- Lower operational burden
- Less control
- Vendor lock-in risk
Build (custom infrastructure):
- Full control and customization
- Higher operational burden
- Higher upfront cost
- Better long-term flexibility
Decision: For most layers, buy first (models, inference, basic agent frameworks). Build at the layers where you have unique requirements (context management, governance, internal tools).
Where the Stack Is Heading
Near term (6-12 months):
- Models continue improving but become commoditized (GPT-4 capability becomes standard)
- Tool calling standardization via MCP accelerates
- Open-source models reach parity for specific domains (code, analysis)
- Governance and observability platforms emerge
Medium term (1-2 years):
- Inference infrastructure becomes more efficient and cheaper
- Context management becomes more sophisticated (semantic understanding of what context is relevant)
- Multi-agent coordination patterns solidify
- Shared context engines become standard (Bitloops-like infrastructure)
Long term (2+ years):
- Models are a commodity service (like cloud storage)
- Differentiation happens at the orchestration and context layers
- Agents become more specialized (different agents for different tasks)
- The stack consolidates around standards (MCP for tools, shared context engines, standardized observability)
Where Your Organization Should Invest
If you're just starting:
- Use hosted models and APIs (no infrastructure burden)
- Use existing agent frameworks (LangChain, LangGraph)
- Focus on context and agent design (garbage in, garbage out)
- Skip custom governance until you hit compliance requirements
If you have 3-5 agents:
- Consider whether you need shared context infrastructure (probably yes)
- Start thinking about observability (you'll need it)
- Use open-source orchestration (Temporal) if you need sophisticated workflows
- Invest in tool design and documentation
If you have 10+ agents:
- Build internal platform infrastructure (tool registry, orchestration, observability)
- Invest in governance and compliance automation
- Consider hybrid model strategy (hosted for breadth, self-hosted for cost)
- Build shared context layer
If you're all-in on agents:
- You're likely building custom at multiple layers (specialized models, custom context engines, domain-specific orchestration)
- You're optimizing for cost and latency
- You have dedicated teams for platform infrastructure
- You're probably using some open-source components and building the rest
The stack will commoditize bottom-up. Models will become commodity first. Then inference. Then tool calling. The valuable differentiation will move up the stack—to context management, orchestration, and governance.
Practical Guidance for Choosing Your Stack Today
Start here. This will work.
Layer 1 (Models): Use Claude, GPT-4, or Gemini via APIs.
Layer 2 (Inference): Use the hosted APIs, no infrastructure decision needed.
Layer 3 (Tool Calling): Use MCP servers when available, custom tools when necessary.
Layer 4 (Context): Start with vector search (embeddings). Graduate to Bitloops or custom when you need shared context.
Layer 5 (Agent Frameworks): Use LangChain or LangGraph. Don't build custom unless you have very specific needs.
Layer 6 (Orchestration): Start with sequential execution. Upgrade to Temporal when you need complex workflows.
Layer 7 (Governance): Implement basic policies (allowlists, permissions). Upgrade to a policy engine when you hit compliance requirements.
Layer 8 (Observability): Log everything. Use an observability platform that supports LLMs (Datadog, New Relic, or custom). Invest in agent-specific observability as you scale.
As your needs grow, you'll replace components. Maybe you swap models. Maybe you build a custom context layer. Maybe you switch from LangChain to a custom framework. That's fine. The important thing is that you understand the stack well enough to make these choices consciously.
FAQ
Should I build a custom model?
No. Not yet. The gap between best-in-class models and everything else is too large. Invest in prompt engineering and context instead. Maybe revisit this in 2 years.
Can I use multiple agents from different vendors?
Yes, if you use MCP to standardize tool definitions. Without standardization, it's a mess.
What if I choose the wrong stack?
You'll find out within 6 months. Migrating stacks is expensive but doable. Start with something simple and upgrade as needed.
How much does the stack cost?
With hosted APIs, probably $500-5,000 per month for reasonable volume (thousands of agent executions). Self-hosted could be lower, but you pay in infrastructure costs.
Which layer matters most?
Context. Everything else is secondary. A good context makes agents better. Good agent frameworks can't compensate for bad context.
Should I use the same model for all agents?
Start with one model, understand it deeply, get good at prompting. Then diversify if you have specific needs (one model for reasoning, one for speed, one for cost efficiency).
How do I avoid lock-in?
Use abstraction layers (LangChain for frameworks, MCP for tools). Keep data portable. Don't let one vendor control the whole stack.
What about open-source?
Use it where it's mature (orchestration, observability, agent frameworks). Be careful with emerging components (context engines, governance layers).
Primary Sources
- Standard specification for connecting agents to tools via the Model Context Protocol. MCP Specification
- Documentation for OpenTelemetry instrumentation and observability in distributed systems. OpenTelemetry Docs
- Temporal documentation for durable workflow orchestration in distributed systems. Temporal Documentation
- Chip Huyen's guide to designing reliable and scalable machine learning systems. ML Systems Design
- Foundational paper on teaching language models to select and use tools during inference. Toolformer Paper
- ReAct framework combining reasoning and acting for improved agent task execution. ReAct Paper
More in this hub
The Modern AI Development Stack: From Models to Production Agent Infrastructure
7 / 10Previous
Article 6
The Fragmented Ecosystem: AI Coding Agents and the Integration Problem
Next
Article 8
Seeing What Agents Do: Observability for AI-Driven Development
Also in this hub
Get Started with Bitloops.
Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.
curl -sSL https://bitloops.com/install.sh | bash