Skip to content
Bitloops - Git captures what changed. Bitloops captures why.
HomeAbout usDocsBlog
ResourcesAgent Tooling & InfrastructureThe Modern AI Development Stack: From Models to Production Agent Infrastructure

The Modern AI Development Stack: From Models to Production Agent Infrastructure

Every AI agent runs on a stack: models, inference, tool calling, context management, orchestration, governance, observability. Most teams build accidentally. This maps the layers, what's mature vs. emerging, and the architectural choices that matter.

13 min readUpdated March 4, 2026Agent Tooling & Infrastructure

The Stack Exists Whether You Acknowledge It Or Not

Every time you run an AI agent, you're using a software stack. It has layers—from the raw model at the bottom, through tool-calling infrastructure, context management, orchestration, governance, and monitoring at the top. Most teams don't consciously design this stack; they just fall into it by using tools that already exist.

The problem is that accidental stacks are slow and fragile. You're using Cursor because it has IDE integration, Claude Code because it has better reasoning, Anthropic's API for the underlying model, and a custom script you wrote to connect everything together. None of these components were designed to work together. They're cobbled. Understanding tooling ecosystems helps you make better architectural choices.

Understanding the AI development stack lets you make intentional architectural choices instead of falling into whatever existed first.

The Layers of the Stack

From bottom to top:

Layer 1: Models

The foundation. The large language models that power everything.

Current State:

  • Proprietary models dominate (Claude, GPT-4, Gemini)
  • Open-source models exist but require serious infrastructure (Llama 2, Mistral, code-specific models)
  • Inference costs are dropping but still material
  • Model capabilities are becoming commoditized—soon you won't choose based on raw quality

Architectural choices:

  • Hosted vs. Local: Use hosted APIs (Anthropic, OpenAI, Google) for convenience and access to the latest models. Use local models for data privacy, lower latency, and cost at scale.
  • Single model vs. Multi-model: Use one model and get good at prompt engineering. Or use specialized models for specialized tasks (one for code generation, one for analysis, one for testing).
  • Proprietary vs. Open-Source: Proprietary models are ahead on capability but locked in. Open-source models are behind on capability but you own the infrastructure.

What's mature: Hosted APIs. They work reliably, handle scale, and you don't think about it.

What's emerging: Open-source models reaching parity on specific tasks (code generation, code analysis). Local inference infrastructure (vLLM, TGI) making it feasible to run models on premise.

Layer 2: Inference Infrastructure

Getting the model to produce outputs quickly and reliably.

Current State:

  • If you're using hosted APIs, this is handled for you
  • If you're running local models, you're using inference engines (vLLM, Ollama, Text Generation Inference)
  • Optimizations are becoming important (quantization, speculative decoding, KV cache optimization)
  • Cost is directly tied to inference efficiency

Architectural choices:

  • Edge vs. Cloud: Run inference on devices (edge) for latency and privacy, or in cloud for scale and simplicity
  • Optimization level: Standard inference vs. quantized vs. speculative decoding. More optimization = lower cost but potentially worse quality
  • Batching: Single request at a time vs. batching requests for efficiency

What's mature: Cloud hosted inference (APIs). Edge inference for certain models (smaller LLMs on devices). Batching for high-volume scenarios.

What's emerging: Speculative decoding (predicting tokens in parallel). Efficient inference frameworks that make local deployment more viable.

Layer 3: Tool Calling and Agent Scaffolding

How agents decide what to do and invoke tools.

Current State:

  • Every major model supports tool calling (function calling, tool use, whatever you call it)
  • Tool definitions are standardized (JSON schema)
  • The Model Context Protocol (MCP) is becoming the standard for tool integration
  • Each agent implementation has slightly different tool-calling patterns

Architectural choices:

  • In-context Tools vs. Parameter Tools: Define tools in the prompt (in-context) or in a structured format (parameter tools). Parameter is cleaner and more reliable.
  • MCP vs. Custom Tools: Use MCP for interoperability and standard tool definitions. Use custom tools if you need special semantics. See Designing Pluggable Tools for best practices.
  • Tool Validation: Validate tool arguments before execution. Don't let the agent call tools with invalid arguments.

What's mature: Basic tool calling works reliably. MCP is stabilizing as a standard.

What's emerging: Advanced agent control (preventing agent loop thrashing, better reasoning about which tools to use). Better error handling and tool failure recovery.

Layer 4: Context Engines

Managing what information the agent knows about and can use.

Current State:

  • No standard yet (this is the gap)
  • Each agent maintains context independently
  • Context is expensive (large context windows, expensive inference on long context)
  • Retrieval-augmented generation (RAG) is the dominant pattern for adding context

Architectural choices:

  • Stateless vs. Stateful: Stateless agents are simple but require full context per request. Stateful agents maintain context but are more complex.
  • Vector Search vs. Structured Retrieval: Use semantic search (vectors, embeddings) to find relevant context. Use structured retrieval (database queries) when you know what you're looking for.
  • Context Compression: Summarize or compress context to fit in smaller windows (cheaper, faster).

What's mature: Vector search (embeddings + similarity search). Basic RAG.

What's emerging: Agentic context management (agents deciding what context they need vs. static retrieval). Shared context engines like Bitloops. Context versioning and rollback.

Layer 5: Agent Frameworks

Orchestrating model invocation, tool calling, and reasoning loops.

Current State:

  • LangChain is the dominant framework (Python)
  • LangGraph adds better reasoning and planning (still Python-focused)
  • CrewAI for multi-agent orchestration (newer, less mature)
  • Many agent implementations are tied to specific IDEs (Cursor, Claude Code) rather than frameworks

Architectural choices:

  • Chain vs. Graph: Chains are simple sequences of steps. Graphs allow branching, loops, and conditional logic. Start with chains, graduate to graphs as complexity grows.
  • Function Calling vs. Reasoning Loops: Function calling means the agent immediately calls a tool. Reasoning loops mean the agent thinks before deciding. Reasoning loops produce better decisions at the cost of more tokens.
  • Abstractions Over APIs: Use framework abstractions that work with multiple models, or tightly couple to one model's API. Abstractions are more portable; tight coupling is more powerful.

What's mature: Basic agent loops. LangChain and LangGraph are production-ready.

What's emerging: Better reasoning integration (chain-of-thought, tree-of-thought). Multi-agent coordination. Agent memory and learning.

Layer 6: Orchestration

Coordinating multiple agents, managing workflows, handling failures.

Current State:

  • Workflow engines (Temporal, Prefect, Airflow) exist but weren't designed for agents
  • Agent-specific orchestration is still being built
  • Most teams use simple patterns (sequential steps, no branching)

Architectural choices:

  • Deterministic vs. Non-Deterministic: Traditional workflows are deterministic (same input, same path). Agent workflows are non-deterministic (same input, different reasoning path). Handle this with built-in replay and idempotency.
  • Approval Gates: Add human review at critical points (before deploying code, before production database modifications)
  • Retry Logic: Agents sometimes fail because they got unlucky. Automatic retries with different context can help.

What's mature: Traditional workflow orchestration (Temporal, Prefect). Using these for agent workflows is a matter of wrapping agents as tasks.

What's emerging: Agent-first orchestration (building for agent semantics rather than adapting existing tools). Better handling of agent failures and non-determinism.

Layer 7: Governance and Safety

Policies, permissions, compliance, preventing bad outcomes.

Current State:

  • Limited governance tooling. Most teams implement this per-agent.
  • Observability platforms are starting to add governance features
  • Audit and compliance are manual in most organizations
  • Security models are ad-hoc

Architectural choices:

  • Centralized vs. Distributed Policy: Central policy engine (easier to maintain) vs. policies distributed to each agent (easier to customize)
  • Allow vs. Deny: Allowlist (only allow specific actions) vs. blocklist (block known bad actions). Allowlist is more secure but more restrictive.
  • Static vs. Dynamic Policy: Policies that don't change during execution vs. policies that can evolve based on agent behavior

What's mature: Not much. This is the frontier.

What's emerging: Governance platforms and policy engines. Integration with observability for enforcement.

Layer 8: Observability and Monitoring

Seeing what agents do, measuring performance, debugging failures.

Current State:

  • Traditional observability (logs, metrics, traces) doesn't cut it for agents
  • Agent-specific observability is being built (LangSmith, Whylabs, custom platforms)
  • Observability is fragmented by platform (each agent has its own logging)

Architectural choices:

  • Centralized vs. Distributed Observability: Central platform (easier to correlate across agents) vs. agent-native logging (easier for each agent to implement)
  • What to Observe: Log all tool calls? Only failures? Agent reasoning? All are important but have different costs.
  • Data Retention: How long to keep logs? Cheaper to delete quickly; harder to debug old issues.

What's mature: Logs and basic metrics. Tracing for individual requests.

What's emerging: Agent-specific observability (decision tracing, reasoning assessment). Cross-agent visibility.

How This Maps to Traditional Development Stacks

The AI stack has parallels to traditional development stacks.

Traditional Stack:

Layered architecture

Application Logic

Framework

Runtime / Language

Operating System

Hardware

AI Stack:

Layered architecture

Observability

Governance

Orchestration

Agent Frameworks

Context Engines

Tool Calling

Inference Infra

Models

The principle is the same: each layer provides abstractions to the layer above. The model layer doesn't care how inference is optimized. The tool calling layer doesn't care which model is underneath. Stacking layers lets teams focus on different concerns.

What's Mature vs. What's Still Emerging

Mature (use with confidence):

  • Model APIs (Claude, GPT-4, Gemini) work reliably
  • Tool calling works well enough for most cases
  • Basic agent loops are solid
  • Traditional observability is mature
  • Vector search for context retrieval works

Approaching Mature:

  • Tool calling standardization via MCP
  • Agent frameworks (LangChain, LangGraph)
  • Orchestration platforms (Temporal)
  • Context management and RAG

Still Emerging:

  • Governance and policy enforcement
  • Agent-specific observability (reasoning tracing, decision quality)
  • Shared context engines
  • Multi-agent coordination
  • Non-deterministic workflow semantics

This matters because you should only bet on immature technology if you have time to adapt. If you need stability, stick to mature layers.

The Key Architectural Decisions Teams Face

When building on the AI stack, you face these choices:

1. Hosted vs. Self-Hosted Models

Hosted (OpenAI, Anthropic, Google):

  • Pros: Latest models, handles scale, no ops burden
  • Cons: Vendor lock-in, recurring cost, latency, data privacy

Self-Hosted (Local Models):

  • Pros: No vendor lock-in, better latency, data privacy, lower cost at scale
  • Cons: Infrastructure burden, slightly lower quality, limited to smaller models

Decision: Use hosted for speed and capability. Use self-hosted for cost and privacy at scale. Most teams use hosted initially, graduate to self-hosted or hybrid as volume grows.

2. Proprietary vs. Open-Source Agents

Proprietary (Claude Code, Cursor, Copilot):

  • Pros: Best in class, tight IDE integration, support
  • Cons: Vendor lock-in, less visibility into behavior

Open-Source (Aider, Continue, local agents):

  • Pros: No lock-in, full visibility, customizable
  • Cons: Fewer capabilities, more infrastructure work

Decision: Use proprietary agents for the best capabilities. Use open-source for customization and avoiding lock-in. Most teams use multiple agents.

3. Centralized vs. Distributed Infrastructure

Centralized:

  • Pros: Consistent policies, easier to manage, single source of truth
  • Cons: Bottleneck, slower to innovate, one outage affects everything

Distributed:

  • Pros: Flexibility, faster iteration, isolated failures
  • Cons: Harder to coordinate, inconsistent policies, more complexity

Decision: Start centralized for simplicity. Move to distributed when bottlenecks appear (usually at 10+ agents).

4. Build vs. Buy for Each Layer

Buy (use existing tools):

  • Faster to value
  • Lower operational burden
  • Less control
  • Vendor lock-in risk

Build (custom infrastructure):

  • Full control and customization
  • Higher operational burden
  • Higher upfront cost
  • Better long-term flexibility

Decision: For most layers, buy first (models, inference, basic agent frameworks). Build at the layers where you have unique requirements (context management, governance, internal tools).

Where the Stack Is Heading

Near term (6-12 months):

  • Models continue improving but become commoditized (GPT-4 capability becomes standard)
  • Tool calling standardization via MCP accelerates
  • Open-source models reach parity for specific domains (code, analysis)
  • Governance and observability platforms emerge

Medium term (1-2 years):

  • Inference infrastructure becomes more efficient and cheaper
  • Context management becomes more sophisticated (semantic understanding of what context is relevant)
  • Multi-agent coordination patterns solidify
  • Shared context engines become standard (Bitloops-like infrastructure)

Long term (2+ years):

  • Models are a commodity service (like cloud storage)
  • Differentiation happens at the orchestration and context layers
  • Agents become more specialized (different agents for different tasks)
  • The stack consolidates around standards (MCP for tools, shared context engines, standardized observability)

Where Your Organization Should Invest

If you're just starting:

  • Use hosted models and APIs (no infrastructure burden)
  • Use existing agent frameworks (LangChain, LangGraph)
  • Focus on context and agent design (garbage in, garbage out)
  • Skip custom governance until you hit compliance requirements

If you have 3-5 agents:

  • Consider whether you need shared context infrastructure (probably yes)
  • Start thinking about observability (you'll need it)
  • Use open-source orchestration (Temporal) if you need sophisticated workflows
  • Invest in tool design and documentation

If you have 10+ agents:

  • Build internal platform infrastructure (tool registry, orchestration, observability)
  • Invest in governance and compliance automation
  • Consider hybrid model strategy (hosted for breadth, self-hosted for cost)
  • Build shared context layer

If you're all-in on agents:

  • You're likely building custom at multiple layers (specialized models, custom context engines, domain-specific orchestration)
  • You're optimizing for cost and latency
  • You have dedicated teams for platform infrastructure
  • You're probably using some open-source components and building the rest

The stack will commoditize bottom-up. Models will become commodity first. Then inference. Then tool calling. The valuable differentiation will move up the stack—to context management, orchestration, and governance.

Practical Guidance for Choosing Your Stack Today

Start here. This will work.

Layer 1 (Models): Use Claude, GPT-4, or Gemini via APIs.

Layer 2 (Inference): Use the hosted APIs, no infrastructure decision needed.

Layer 3 (Tool Calling): Use MCP servers when available, custom tools when necessary.

Layer 4 (Context): Start with vector search (embeddings). Graduate to Bitloops or custom when you need shared context.

Layer 5 (Agent Frameworks): Use LangChain or LangGraph. Don't build custom unless you have very specific needs.

Layer 6 (Orchestration): Start with sequential execution. Upgrade to Temporal when you need complex workflows.

Layer 7 (Governance): Implement basic policies (allowlists, permissions). Upgrade to a policy engine when you hit compliance requirements.

Layer 8 (Observability): Log everything. Use an observability platform that supports LLMs (Datadog, New Relic, or custom). Invest in agent-specific observability as you scale.

As your needs grow, you'll replace components. Maybe you swap models. Maybe you build a custom context layer. Maybe you switch from LangChain to a custom framework. That's fine. The important thing is that you understand the stack well enough to make these choices consciously.

FAQ

Should I build a custom model?

No. Not yet. The gap between best-in-class models and everything else is too large. Invest in prompt engineering and context instead. Maybe revisit this in 2 years.

Can I use multiple agents from different vendors?

Yes, if you use MCP to standardize tool definitions. Without standardization, it's a mess.

What if I choose the wrong stack?

You'll find out within 6 months. Migrating stacks is expensive but doable. Start with something simple and upgrade as needed.

How much does the stack cost?

With hosted APIs, probably $500-5,000 per month for reasonable volume (thousands of agent executions). Self-hosted could be lower, but you pay in infrastructure costs.

Which layer matters most?

Context. Everything else is secondary. A good context makes agents better. Good agent frameworks can't compensate for bad context.

Should I use the same model for all agents?

Start with one model, understand it deeply, get good at prompting. Then diversify if you have specific needs (one model for reasoning, one for speed, one for cost efficiency).

How do I avoid lock-in?

Use abstraction layers (LangChain for frameworks, MCP for tools). Keep data portable. Don't let one vendor control the whole stack.

What about open-source?

Use it where it's mature (orchestration, observability, agent frameworks). Be careful with emerging components (context engines, governance layers).

Primary Sources

  • Standard specification for connecting agents to tools via the Model Context Protocol. MCP Specification
  • Documentation for OpenTelemetry instrumentation and observability in distributed systems. OpenTelemetry Docs
  • Temporal documentation for durable workflow orchestration in distributed systems. Temporal Documentation
  • Chip Huyen's guide to designing reliable and scalable machine learning systems. ML Systems Design
  • Foundational paper on teaching language models to select and use tools during inference. Toolformer Paper
  • ReAct framework combining reasoning and acting for improved agent task execution. ReAct Paper

Get Started with Bitloops.

Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.

curl -sSL https://bitloops.com/install.sh | bash