The Modern AI Development Stack: From Models to Production Agent Infrastructure

The Stack Exists Whether You Acknowledge It Or Not

Every time you run an AI agent, you're using a software stack. It has layers—from the raw model at the bottom, through tool-calling infrastructure, context management, orchestration, governance, and monitoring at the top. Most teams don't consciously design this stack; they just fall into it by using tools that already exist.

The problem is that accidental stacks are slow and fragile. You're using Cursor because it has IDE integration, Claude Code because it has better reasoning, Anthropic's API for the underlying model, and a custom script you wrote to connect everything together. None of these components were designed to work together. They're cobbled. Understanding tooling ecosystems helps you make better architectural choices.

Understanding the AI development stack lets you make intentional architectural choices instead of falling into whatever existed first.

The Layers of the Stack

From bottom to top:

Layer 1: Models

The foundation. The large language models that power everything.

Current State:

Proprietary models dominate (Claude, GPT-4, Gemini)
Open-source models exist but require serious infrastructure (Llama 2, Mistral, code-specific models)
Inference costs are dropping but still material
Model capabilities are becoming commoditized—soon you won't choose based on raw quality

Architectural choices:

Hosted vs. Local: Use hosted APIs (Anthropic, OpenAI, Google) for convenience and access to the latest models. Use local models for data privacy, lower latency, and cost at scale.
Single model vs. Multi-model: Use one model and get good at prompt engineering. Or use specialized models for specialized tasks (one for code generation, one for analysis, one for testing).
Proprietary vs. Open-Source: Proprietary models are ahead on capability but locked in. Open-source models are behind on capability but you own the infrastructure.

What's mature: Hosted APIs. They work reliably, handle scale, and you don't think about it.

What's emerging: Open-source models reaching parity on specific tasks (code generation, code analysis). Local inference infrastructure (vLLM, TGI) making it feasible to run models on premise.

Layer 2: Inference Infrastructure

Getting the model to produce outputs quickly and reliably.

Current State:

If you're using hosted APIs, this is handled for you
If you're running local models, you're using inference engines (vLLM, Ollama, Text Generation Inference)
Optimizations are becoming important (quantization, speculative decoding, KV cache optimization)
Cost is directly tied to inference efficiency

Architectural choices:

Edge vs. Cloud: Run inference on devices (edge) for latency and privacy, or in cloud for scale and simplicity
Optimization level: Standard inference vs. quantized vs. speculative decoding. More optimization = lower cost but potentially worse quality
Batching: Single request at a time vs. batching requests for efficiency

What's mature: Cloud hosted inference (APIs). Edge inference for certain models (smaller LLMs on devices). Batching for high-volume scenarios.

What's emerging: Speculative decoding (predicting tokens in parallel). Efficient inference frameworks that make local deployment more viable.

Layer 3: Tool Calling and Agent Scaffolding

How agents decide what to do and invoke tools.

Current State:

Every major model supports tool calling (function calling, tool use, whatever you call it)
Tool definitions are standardized (JSON schema)
The Model Context Protocol (MCP) is becoming the standard for tool integration
Each agent implementation has slightly different tool-calling patterns

Architectural choices:

In-context Tools vs. Parameter Tools: Define tools in the prompt (in-context) or in a structured format (parameter tools). Parameter is cleaner and more reliable.
MCP vs. Custom Tools: Use MCP for interoperability and standard tool definitions. Use custom tools if you need special semantics. See Designing Pluggable Tools for best practices.
Tool Validation: Validate tool arguments before execution. Don't let the agent call tools with invalid arguments.

What's mature: Basic tool calling works reliably. MCP is stabilizing as a standard.

What's emerging: Advanced agent control (preventing agent loop thrashing, better reasoning about which tools to use). Better error handling and tool failure recovery.

Layer 4: Context Engines

Managing what information the agent knows about and can use.

Current State:

No standard yet (this is the gap)
Each agent maintains context independently
Context is expensive (large context windows, expensive inference on long context)
Retrieval-augmented generation (RAG) is the dominant pattern for adding context

Architectural choices:

Stateless vs. Stateful: Stateless agents are simple but require full context per request. Stateful agents maintain context but are more complex.
Vector Search vs. Structured Retrieval: Use semantic search (vectors, embeddings) to find relevant context. Use structured retrieval (database queries) when you know what you're looking for.
Context Compression: Summarize or compress context to fit in smaller windows (cheaper, faster).

What's mature: Vector search (embeddings + similarity search). Basic RAG.

What's emerging: Agentic context management (agents deciding what context they need vs. static retrieval). Shared context engines like Bitloops. Context versioning and rollback.

Layer 5: Agent Frameworks

Orchestrating model invocation, tool calling, and reasoning loops.

Current State:

LangChain is the dominant framework (Python)
LangGraph adds better reasoning and planning (still Python-focused)
CrewAI for multi-agent orchestration (newer, less mature)
Many agent implementations are tied to specific IDEs (Cursor, Claude Code) rather than frameworks

Architectural choices:

Chain vs. Graph: Chains are simple sequences of steps. Graphs allow branching, loops, and conditional logic. Start with chains, graduate to graphs as complexity grows.
Function Calling vs. Reasoning Loops: Function calling means the agent immediately calls a tool. Reasoning loops mean the agent thinks before deciding. Reasoning loops produce better decisions at the cost of more tokens.
Abstractions Over APIs: Use framework abstractions that work with multiple models, or tightly couple to one model's API. Abstractions are more portable; tight coupling is more powerful.

What's mature: Basic agent loops. LangChain and LangGraph are production-ready.

What's emerging: Better reasoning integration (chain-of-thought, tree-of-thought). Multi-agent coordination. Agent memory and learning.

Layer 6: Orchestration

Coordinating multiple agents, managing workflows, handling failures.

Current State:

Workflow engines (Temporal, Prefect, Airflow) exist but weren't designed for agents
Agent-specific orchestration is still being built
Most teams use simple patterns (sequential steps, no branching)

Architectural choices:

Deterministic vs. Non-Deterministic: Traditional workflows are deterministic (same input, same path). Agent workflows are non-deterministic (same input, different reasoning path). Handle this with built-in replay and idempotency.
Approval Gates: Add human review at critical points (before deploying code, before production database modifications)
Retry Logic: Agents sometimes fail because they got unlucky. Automatic retries with different context can help.

What's mature: Traditional workflow orchestration (Temporal, Prefect). Using these for agent workflows is a matter of wrapping agents as tasks.

What's emerging: Agent-first orchestration (building for agent semantics rather than adapting existing tools). Better handling of agent failures and non-determinism.

Layer 7: Governance and Safety

Policies, permissions, compliance, preventing bad outcomes.

Current State:

Limited governance tooling. Most teams implement this per-agent.
Observability platforms are starting to add governance features
Audit and compliance are manual in most organizations
Security models are ad-hoc

Architectural choices:

Centralized vs. Distributed Policy: Central policy engine (easier to maintain) vs. policies distributed to each agent (easier to customize)
Allow vs. Deny: Allowlist (only allow specific actions) vs. blocklist (block known bad actions). Allowlist is more secure but more restrictive.
Static vs. Dynamic Policy: Policies that don't change during execution vs. policies that can evolve based on agent behavior

What's mature: Not much. This is the frontier.

What's emerging: Governance platforms and policy engines. Integration with observability for enforcement.

Layer 8: Observability and Monitoring

Seeing what agents do, measuring performance, debugging failures.

Current State:

Traditional observability (logs, metrics, traces) doesn't cut it for agents
Agent-specific observability is being built (LangSmith, Whylabs, custom platforms)
Observability is fragmented by platform (each agent has its own logging)

Architectural choices:

Centralized vs. Distributed Observability: Central platform (easier to correlate across agents) vs. agent-native logging (easier for each agent to implement)
What to Observe: Log all tool calls? Only failures? Agent reasoning? All are important but have different costs.
Data Retention: How long to keep logs? Cheaper to delete quickly; harder to debug old issues.

What's mature: Logs and basic metrics. Tracing for individual requests.

What's emerging: Agent-specific observability (decision tracing, reasoning assessment). Cross-agent visibility.

How This Maps to Traditional Development Stacks

The AI stack has parallels to traditional development stacks.

Traditional Stack:

Layered architecture

Application Logic

↓

Framework

↓

Runtime / Language

↓

Operating System

↓

Hardware

AI Stack:

Layered architecture

Observability

↓

Governance

↓

Orchestration

↓

Agent Frameworks

↓

Context Engines

↓

Tool Calling

↓

Inference Infra

↓

Models

The principle is the same: each layer provides abstractions to the layer above. The model layer doesn't care how inference is optimized. The tool calling layer doesn't care which model is underneath. Stacking layers lets teams focus on different concerns.

What's Mature vs. What's Still Emerging

Mature (use with confidence):

Model APIs (Claude, GPT-4, Gemini) work reliably
Tool calling works well enough for most cases
Basic agent loops are solid
Traditional observability is mature
Vector search for context retrieval works

Approaching Mature:

Tool calling standardization via MCP
Agent frameworks (LangChain, LangGraph)
Orchestration platforms (Temporal)
Context management and RAG

Still Emerging:

Governance and policy enforcement
Agent-specific observability (reasoning tracing, decision quality)
Shared context engines
Multi-agent coordination
Non-deterministic workflow semantics

This matters because you should only bet on immature technology if you have time to adapt. If you need stability, stick to mature layers.

The Key Architectural Decisions Teams Face

When building on the AI stack, you face these choices:

1. Hosted vs. Self-Hosted Models

Hosted (OpenAI, Anthropic, Google):

Pros: Latest models, handles scale, no ops burden
Cons: Vendor lock-in, recurring cost, latency, data privacy

Self-Hosted (Local Models):

Pros: No vendor lock-in, better latency, data privacy, lower cost at scale
Cons: Infrastructure burden, slightly lower quality, limited to smaller models

Decision: Use hosted for speed and capability. Use self-hosted for cost and privacy at scale. Most teams use hosted initially, graduate to self-hosted or hybrid as volume grows.

2. Proprietary vs. Open-Source Agents

Proprietary (Claude Code, Cursor, Copilot):

Pros: Best in class, tight IDE integration, support
Cons: Vendor lock-in, less visibility into behavior

Open-Source (Aider, Continue, local agents):

Pros: No lock-in, full visibility, customizable
Cons: Fewer capabilities, more infrastructure work

Decision: Use proprietary agents for the best capabilities. Use open-source for customization and avoiding lock-in. Most teams use multiple agents.

3. Centralized vs. Distributed Infrastructure

Centralized:

Pros: Consistent policies, easier to manage, single source of truth
Cons: Bottleneck, slower to innovate, one outage affects everything

Distributed:

Pros: Flexibility, faster iteration, isolated failures
Cons: Harder to coordinate, inconsistent policies, more complexity

Decision: Start centralized for simplicity. Move to distributed when bottlenecks appear (usually at 10+ agents).

4. Build vs. Buy for Each Layer

Buy (use existing tools):

Faster to value
Lower operational burden
Less control
Vendor lock-in risk

Build (custom infrastructure):

Full control and customization
Higher operational burden
Higher upfront cost
Better long-term flexibility

Decision: For most layers, buy first (models, inference, basic agent frameworks). Build at the layers where you have unique requirements (context management, governance, internal tools).

Where the Stack Is Heading

Near term (6-12 months):

Models continue improving but become commoditized (GPT-4 capability becomes standard)
Tool calling standardization via MCP accelerates
Open-source models reach parity for specific domains (code, analysis)
Governance and observability platforms emerge

Medium term (1-2 years):

Inference infrastructure becomes more efficient and cheaper
Context management becomes more sophisticated (semantic understanding of what context is relevant)
Multi-agent coordination patterns solidify
Shared context engines become standard (Bitloops-like infrastructure)

Long term (2+ years):

Models are a commodity service (like cloud storage)
Differentiation happens at the orchestration and context layers
Agents become more specialized (different agents for different tasks)
The stack consolidates around standards (MCP for tools, shared context engines, standardized observability)

Where Your Organization Should Invest

If you're just starting:

Use hosted models and APIs (no infrastructure burden)
Use existing agent frameworks (LangChain, LangGraph)
Focus on context and agent design (garbage in, garbage out)
Skip custom governance until you hit compliance requirements

If you have 3-5 agents:

Consider whether you need shared context infrastructure (probably yes)
Start thinking about observability (you'll need it)
Use open-source orchestration (Temporal) if you need sophisticated workflows
Invest in tool design and documentation

If you have 10+ agents:

Build internal platform infrastructure (tool registry, orchestration, observability)
Invest in governance and compliance automation
Consider hybrid model strategy (hosted for breadth, self-hosted for cost)
Build shared context layer

If you're all-in on agents:

You're likely building custom at multiple layers (specialized models, custom context engines, domain-specific orchestration)
You're optimizing for cost and latency
You have dedicated teams for platform infrastructure
You're probably using some open-source components and building the rest

The stack will commoditize bottom-up. Models will become commodity first. Then inference. Then tool calling. The valuable differentiation will move up the stack—to context management, orchestration, and governance.

Practical Guidance for Choosing Your Stack Today

Start here. This will work.

Layer 1 (Models): Use Claude, GPT-4, or Gemini via APIs.

Layer 2 (Inference): Use the hosted APIs, no infrastructure decision needed.

Layer 3 (Tool Calling): Use MCP servers when available, custom tools when necessary.

Layer 4 (Context): Start with vector search (embeddings). Graduate to Bitloops or custom when you need shared context.

Layer 5 (Agent Frameworks): Use LangChain or LangGraph. Don't build custom unless you have very specific needs.

Layer 6 (Orchestration): Start with sequential execution. Upgrade to Temporal when you need complex workflows.

Layer 7 (Governance): Implement basic policies (allowlists, permissions). Upgrade to a policy engine when you hit compliance requirements.

Layer 8 (Observability): Log everything. Use an observability platform that supports LLMs (Datadog, New Relic, or custom). Invest in agent-specific observability as you scale.

As your needs grow, you'll replace components. Maybe you swap models. Maybe you build a custom context layer. Maybe you switch from LangChain to a custom framework. That's fine. The important thing is that you understand the stack well enough to make these choices consciously.

FAQ

Should I build a custom model?

No. Not yet. The gap between best-in-class models and everything else is too large. Invest in prompt engineering and context instead. Maybe revisit this in 2 years.

Can I use multiple agents from different vendors?

Yes, if you use MCP to standardize tool definitions. Without standardization, it's a mess.

What if I choose the wrong stack?

You'll find out within 6 months. Migrating stacks is expensive but doable. Start with something simple and upgrade as needed.

How much does the stack cost?

With hosted APIs, probably $500-5,000 per month for reasonable volume (thousands of agent executions). Self-hosted could be lower, but you pay in infrastructure costs.

Which layer matters most?

Context. Everything else is secondary. A good context makes agents better. Good agent frameworks can't compensate for bad context.

Should I use the same model for all agents?

Start with one model, understand it deeply, get good at prompting. Then diversify if you have specific needs (one model for reasoning, one for speed, one for cost efficiency).

How do I avoid lock-in?

Use abstraction layers (LangChain for frameworks, MCP for tools). Keep data portable. Don't let one vendor control the whole stack.

What about open-source?

Use it where it's mature (orchestration, observability, agent frameworks). Be careful with emerging components (context engines, governance layers).

Primary Sources

Standard specification for connecting agents to tools via the Model Context Protocol. MCP Specification
Documentation for OpenTelemetry instrumentation and observability in distributed systems. OpenTelemetry Docs
Temporal documentation for durable workflow orchestration in distributed systems. Temporal Documentation
Chip Huyen's guide to designing reliable and scalable machine learning systems. ML Systems Design
Foundational paper on teaching language models to select and use tools during inference. Toolformer Paper
ReAct framework combining reasoning and acting for improved agent task execution. ReAct Paper

The Stack Exists Whether You Acknowledge It Or Not

The Layers of the Stack

Layer 1: Models

Layer 2: Inference Infrastructure

Layer 3: Tool Calling and Agent Scaffolding

Layer 4: Context Engines

Layer 5: Agent Frameworks

Layer 6: Orchestration

Layer 7: Governance and Safety

Layer 8: Observability and Monitoring

How This Maps to Traditional Development Stacks

What's Mature vs. What's Still Emerging

The Key Architectural Decisions Teams Face

1. Hosted vs. Self-Hosted Models

2. Proprietary vs. Open-Source Agents

3. Centralized vs. Distributed Infrastructure

4. Build vs. Buy for Each Layer

Where the Stack Is Heading

Where Your Organization Should Invest

Practical Guidance for Choosing Your Stack Today

FAQ

Should I build a custom model?

Can I use multiple agents from different vendors?

What if I choose the wrong stack?

How much does the stack cost?

Which layer matters most?

Should I use the same model for all agents?

How do I avoid lock-in?

What about open-source?

Primary Sources

More in this hub

Get Started with Bitloops.