From Experiment to Infrastructure: Building Internal Agent Platforms
Buying works for 1-3 simple agents. At 5+ agents with overlapping needs, you'll likely build. This covers the architecture: tool registry, context management, orchestration, governance. When to build it, what goes in it, how to avoid the pitfalls.
When to Build vs. Buy (And Why Most Teams Get This Wrong)
Here's what happens: your team builds an AI agent for one task. It works great. Someone asks, "can we build an agent for X?" and "an agent for Y?" Suddenly you need five agents, and they're doing overlapping things, and nobody can explain how they work together, and your infrastructure is chaos.
At this point, you face a choice. You can buy a platform (hosted or open-source), or you can build your own infrastructure. Most teams think "we're engineers, we can build this." Some teams are right. Most are wrong.
When to buy:
- You have fewer than 3 agents
- Your agents have simple requirements (one tool, one context type)
- Your compliance and security requirements are standard
- You want something working today, not 6 months from now
- You don't want operational overhead
When to build:
- You have 5+ agents or plans for many more
- Your agents need to cooperate on shared tasks
- Your compliance requirements are unusual (air-gapped, highly regulated)
- Your infrastructure has constraints that off-the-shelf solutions don't accommodate
- You have a dedicated platform team with 2-3 engineers
Most teams should buy (or start with buying and transition to building later). But if you're serious about agents as infrastructure, you'll eventually need to build. Let's talk about what that looks like.
What an Internal Agent Platform Looks Like
An internal platform has several key components:
1. Tool Registry
Your agents need to know what tools exist and how to use them. The registry is the source of truth.
Tool Registry Entry:
{
name: "execute_python",
description: "Execute Python code in a sandboxed environment",
parameters: {
code: "The Python code to execute (string, required)",
timeout: "Max execution time in seconds (int, default 30)",
packages: "List of pip packages to install (array)"
},
permissions_required: ["code_execution"],
sandbox_config: {
cpu_limit: 2,
memory_limit: "4GB",
network: "none"
},
audit: true,
cost_per_call: 0.01,
owner: "platform-team",
version: "1.2",
deprecated: false
}The registry tracks:
- What the tool does and how to use it
- What permissions are required
- What resources it uses
- Whether it's sandboxed and how
- Cost and audit requirements
- Who owns it and whether it's maintained
Agents look up tools in the registry to know what's available and how to call them. This prevents tool confusion, standardizes interfaces, and gives you a single place to manage deprecation and versioning.
2. Agent Orchestration
When multiple agents are working on related tasks, you need a way to coordinate them.
Workflow: "refactor_and_test"
Step 1: refactor_agent
Input: codebase
Task: "Refactor the authentication module"
Output: modified_code
Step 2: test_agent
Input: modified_code (from step 1)
Task: "Write tests for the refactored code"
Output: test_cases
Step 3: human_review
Input: [refactored code, test cases] (from steps 1-2)
Task: "Review both and approve or request changes"
Output: approval / feedback
Step 4: merge_agent (conditional on step 3)
Input: [modified_code, test_cases] (from steps 1-2)
Task: "Merge the code and tests into the repository"
Output: commit_hash- Sequential execution (step 2 waits for step 1)
- Conditional logic (only merge if approved)
- Data flow (pass outputs from one step to the next)
- Error handling (what if the agent fails?)
- Monitoring (track progress, retry logic)
Without orchestration, coordinating multiple agents is manual and error-prone.
3. Shared Memory / Context Layer
All agents need access to shared context: the codebase, recent decisions, shared data.
Shared Context:
{
project: "auth-refactor",
files: [
{ path: "src/auth/login.py", state: "modified_by: refactor_agent", version: 3 },
{ path: "src/auth/tokens.py", state: "unchanged", version: 1 }
],
decisions: [
{ timestamp: "2026-03-04T10:23:00Z", agent: "refactor_agent", decision: "Move JWT validation to separate module", status: "implemented" }
],
status: "in_progress",
next_step: "test_agent"
}The shared context:
- Prevents agents from working on stale information
- Lets agents learn from each other's decisions
- Provides a single source of truth for the current state
- Enables rollback and recovery
This is where tools like Bitloops come in. Instead of each agent maintaining its own context, there's a centralized context engine that all agents read and write to. This becomes even more important when managing multi-agent collaboration at scale.
4. Governance and Compliance
As you scale from one agent to five to fifty, you need policies that apply across all of them.
Policy: code_execution_limits
Applies To: all agents with "code_execution" permission
Rules:
- Max execution time: 30 seconds
- Max memory: 4GB
- No network access
- Log all executions
- Audit trail required
Policy: production_access
Applies To: agents with "production_database" permission
Rules:
- Read-only access unless explicitly approved
- All queries logged and auditable
- Require approval for writes
- Staging environment only for testing
- Automatic rollback after 24 hours- Enforce consistency across agents
- Enable compliance (audit, regulatory requirements)
- Define what agents can and can't do
- Provide guardrails so platform teams don't have to reinvent security per-agent
5. Observability and Monitoring
You need visibility into agent behavior across your organization.
Dashboard: Agent Operations
- Total agents: 23
- Agents running now: 5
- Agents succeeded today: 1,247
- Agents failed today: 3 (0.24% failure rate)
- Average cost per agent: $0.47
- Top tools by usage: execute_python (32%), read_file (28%), call_api (18%)
- Cost trend: +12% week-over-week (needs investigation)Observability includes:
- What agents are doing right now
- Failure rates and common failure modes
- Cost and resource usage
- Performance trends
- Audit trails for compliance
- Alerts for unusual behavior
Without this, you can't operate agents reliably at scale.
Architecture Patterns
Pattern 1: Centralized Platform
One central team owns all the infrastructure. All agents, all tools, all governance.
Layered architecture
Central Agent Platform
┌─────────────────────────────────────┐
│ Tool Registry │
│ Orchestration Engine │
│ Shared Context │
│ Governance Enforcement │
│ Observability & Monitoring │
└─────────────────────────────────────┘
│ │
Advantages:
- Consistent policies and tooling
- Easier to manage and evolve
- Clear ownership
- Efficient resource sharing
Disadvantages:
- Central team becomes a bottleneck
- Hard to customize for different domains
- One outage affects all agents
- Slower to respond to specific team needs
Pattern 2: Federated Model
Multiple teams own their own agents and tools, with a minimal central platform for shared infrastructure.
Layered architecture
Shared Infrastructure
- Context Layer (Bitloops or similar)
- Orchestration Bus
- Audit & Observability
│ │
Backend │ │ Frontend
Team │ │ Team
Agents │ │ Agents
Tools │ │ Tools
Integration │ │ Mobile
Team │ │ Team
Agents │ │ Agents
Tools │ │ Tools
Advantages:
- Teams move fast independently
- Customized tooling per domain
- Natural scaling with organization
- Less likely to be a single point of failure
Disadvantages:
- Risk of inconsistent patterns
- Harder to enforce governance across teams
- More operational complexity
- Can lead to tool fragmentation
The best approach is often a hybrid: centralized platform for critical infrastructure (observability, governance, context) and federated ownership of domain-specific tools and agents.
The Platform Team's Responsibilities
If you're building an internal platform, the platform team owns:
- Tool Curation: What tools are available? Who maintains them? When do they get deprecated?
- Security and Compliance: Permission models, audit trails, data access controls, encryption, regulatory compliance.
- Cost Management: Tracking what agents cost to run, enforcing budgets per team, optimizing expensive operations.
- Observability and Monitoring: Dashboards, alerts, failure investigation, performance tracking.
- Documentation and Runbooks: How do teams use the platform? What do they do when something breaks?
- Governance: Policies for what agents can and can't do, approval processes for sensitive operations.
- Operational Stability: Keeping the platform running, updating dependencies, handling failures gracefully.
- Evolution: Making the platform better over time, responding to team feedback, adopting new capabilities.
This is not a 1-person job. You need:
- At least one person for operations/reliability
- At least one person for tools and integrations
- At least one person for observability and tooling
- Part-time support from users/teams
If you don't have this team, you're not ready to build an internal platform. You should buy instead.
The Practical Build Path
Here's how to actually build this without boiling the ocean:
Phase 1: Proof of Concept (Weeks 1-4)
Pick one agent and one use case. Build just enough infrastructure to make it work.
You have:
- One agent (code generation)
- One context source (the user's codebase)
- One tool (execute_python)
- Manual orchestration (you run the agent, review output, run next step)
- Logging to a file
You don't have:
- Multiple agents
- Shared context
- Governance policies
- DashboardsGoal: prove that agents can add value. Build confidence that this is worth investing in.
Phase 2: Generalization (Weeks 5-12)
Take what you learned and build the minimum viable platform for 3-5 agents.
You have:
- Tool registry (simple, probably a YAML file)
- Basic orchestration (chaining agents)
- Shared context (read-write to a database)
- Simple observability (CSV logs, Excel dashboard)
- Minimal governance (allowlist of tools)
You don't have:
- Advanced orchestration (branching, retries)
- Complex policies
- Real-time dashboards
- Advanced audit trailsGoal: make it possible for another team to add an agent without talking to you first.
Phase 3: Scaling (Months 4-6)
As you hit 5-10 agents, build the infrastructure to scale.
You add:
- Proper database for context
- Orchestration framework (Temporal, Prefect, or custom)
- Policy engine
- Real observability platform
- Team dashboards
- Self-service agent deploymentPhase 4: Maturity (Months 6+)
Settle into operations. Focus on:
- Cost optimization
- Performance optimization
- Security hardening
- Compliance and audit
- Documentation
The build path has a natural rhythm. You start simple and add complexity only when you hit limits. Don't pre-optimize.
Common Mistakes to Avoid
Mistake 1: Over-Engineering Before Proving Value
You design the "perfect" platform architecture before any agents exist. You build fancy orchestration, advanced governance, beautiful dashboards. Then you find out that agents aren't as useful as you thought, or the problem you solved isn't actually your problem.
Instead: Build the minimum viable platform first. Prove value. Then invest in infrastructure.
Mistake 2: Ignoring Security Until It's Too Late
You build the platform with wide-open permissions. Everything can call everything. Then you deploy agents to production and realize you can't control what they do.
Instead: Security and compliance should be part of the design from day one. Not perfect security, but thoughtful security.
Mistake 3: Not Measuring ROI
You build 10 agents and spend 6 months on platform infrastructure. You never measure whether agents actually save time or money. You can't justify continued investment.
Instead: Measure from the beginning. How much time do agents save per task? What's the cost per task? Is the math working?
Mistake 4: Building Without a Dedicated Platform Team
You ask engineers to "own the platform" as a side project. They don't, because they're busy with other work. The platform stagnates.
Instead: Dedicate engineers to the platform team. Make it their primary responsibility. You need at least 1.5 FTE for the platform to stay healthy.
Mistake 5: Not Involving Teams Until You're Done
You build the platform in isolation. Then you launch it and teams hate it because you didn't ask what they needed.
Instead: Involve teams early and often. Gather feedback. Iterate based on what you learn.
Mistake 6: Treating Agents as Black Boxes
You deploy agents but you don't understand how they work or why they fail. When something breaks, you can't debug it.
Instead: Build observability into the platform from the start. Make agent decision-making visible. Invest in debugging tools.
How Open-Source Infrastructure Fits In
You don't need to build everything from scratch. Open-source tools can form the foundation:
- Bitloops: Context engine and observability layer for multi-agent systems
- Temporal: Workflow orchestration (mature, battle-tested)
- LangChain: Agent framework and tool abstraction (Python)
- MCP Servers: Standardized tool definitions (becoming widespread)
- OpenTelemetry: Observability instrumentation (standard)
A smart build path uses open-source for the hard parts (orchestration, observability) and builds custom infrastructure for your specific needs (tool registry, domain-specific policies).
Bitloops in particular is useful because it solves the context problem in an agent-agnostic way. You can use Bitloops to manage context, then plug any agent into it. This reduces the amount of custom infrastructure you need to build. For security considerations when deploying agents at scale, see Secure Tool Invocation.
FAQ
How many agents do I need before I should build a platform?
5-7 agents is the inflection point. Before that, point solutions and manual orchestration work. After that, fragmentation becomes a real problem.
Should I build the platform or hire it out?
You need internal ownership either way. You can hire contractors to help, but the platform team needs to include your own engineers who understand your business.
How long does it take to build a basic platform?
8-12 weeks for the MVP (tool registry, basic orchestration, minimal observability). 6 months for something production-ready. Don't believe anyone who says shorter.
Can I build a platform with one engineer?
Maybe for 3-5 agents. Beyond that, you need at least 1.5-2 engineers dedicated to the platform. Everything else suffers.
What if I pick the wrong architecture?
You'll know within a few months. If centralized isn't working, move to federated. If federated is chaos, move to centralized. Architectures aren't permanent.
How do I handle upgrades to the underlying agent frameworks?
Plan for it. When Claude Code updates, when Cursor updates, your platform might need changes. This is why abstraction layers matter.
What about compliance and audit?
Build audit logging into the platform from day one. Make it cheap to add governance policies. When compliance requirements come (and they will), you're ready.
Should each team have their own agents or should we share agents?
Share agents where it makes sense (code analysis, testing, documentation). Keep teams owning domain-specific agents (code generation for their stack). This balances efficiency and autonomy.
Primary Sources
- Documentation for Temporal workflow engine enabling durable, scalable orchestration of microservices. Temporal Documentation
- Martin Fowler's article on platform engineering prerequisites and organizational structures. Platform Prerequisites
- Foundational paper on teaching language models to select and use tools during inference. Toolformer Paper
- ReAct framework combining reasoning and acting for enhanced agent task execution. ReAct Paper
- Standard specification for connecting agents to tools via the Model Context Protocol. MCP Specification
- OpenAI's comprehensive guide to function calling for structured tool invocation in GPT models. OpenAI Tool Use
More in this hub
From Experiment to Infrastructure: Building Internal Agent Platforms
10 / 10Also in this hub
Get Started with Bitloops.
Apply what you learn in these hubs to real AI-assisted delivery workflows with shared context, traceable reasoning, and architecture-aware engineering practices.
curl -sSL https://bitloops.com/install.sh | bash