Building AI Agents That Write and Deploy Their Own Tools: Self-Extending Agent Systems
Discover how to build AI agents that can write new Python tools at runtime, validate them in a sandbox, register them dynamically, and use them in subsequent reasoning — creating truly self-extending agent systems.
The Limitation of Static Tool Sets
Every agent framework requires you to pre-define tools. You write Python functions, decorate them, and register them with the agent at initialization time. The agent can only do what its tools allow. If a user asks for something no tool covers, the agent either hallucinates an answer or says "I cannot do that."
Self-extending agents break this limitation. When the agent encounters a task that its current tools cannot handle, it writes a new tool — a Python function — validates it in a sandbox, registers it, and immediately uses it. The next time a similar task appears, the tool is already available.
Architecture of a Self-Extending Agent
The system has four components: a code generation module that writes tool functions, a sandbox that executes untrusted code safely, a tool registry that manages dynamic tools, and the agent loop that ties them together.
import ast
import importlib
import types
from typing import Any
class ToolRegistry:
"""Manages both static and dynamically created tools."""
def __init__(self):
self.tools: dict[str, callable] = {}
self.tool_source: dict[str, str] = {}
def register_static(self, name: str, fn: callable):
self.tools[name] = fn
def register_dynamic(self, name: str, source_code: str):
"""Compile and register a dynamically generated tool."""
# Validate the code is safe before execution
self._validate_code(source_code)
# Compile and execute in a restricted namespace
namespace: dict[str, Any] = {}
exec(compile(source_code, f"<dynamic:{name}>", "exec"), namespace)
if name not in namespace:
raise ValueError(f"Source code must define a function named '{name}'")
self.tools[name] = namespace[name]
self.tool_source[name] = source_code
def _validate_code(self, source: str):
"""Static analysis to block dangerous operations."""
tree = ast.parse(source)
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
if alias.name in ("os", "subprocess", "shutil", "sys"):
raise SecurityError(f"Import of '{alias.name}' is blocked")
if isinstance(node, ast.Call):
if isinstance(node.func, ast.Name):
if node.func.id in ("exec", "eval", "compile", "__import__"):
raise SecurityError(f"Call to '{node.func.id}' is blocked")
def list_tools(self) -> list[str]:
return list(self.tools.keys())
def call(self, name: str, **kwargs) -> Any:
if name not in self.tools:
raise KeyError(f"Tool '{name}' not found")
return self.tools[name](**kwargs)
class SecurityError(Exception):
pass
The Code Generation Prompt
The agent needs a specialized tool that generates other tools. The prompt engineering here is critical — the LLM must produce well-structured, safe Python functions.
TOOL_GENERATION_PROMPT = """You are a tool-writing assistant. When asked to create a new tool,
output ONLY a Python function with the following requirements:
1. The function must have a clear docstring describing what it does
2. All parameters must have type annotations
3. The function must return a value (not print)
4. Only use these allowed imports: math, json, re, datetime, collections, statistics
5. The function name must be snake_case
6. Include input validation
Example format:
import math
def calculate_compound_interest(principal: float, rate: float, years: int) -> float:
"""Calculate compound interest given principal, annual rate, and years."""
if principal < 0 or rate < 0 or years < 0:
raise ValueError("All values must be non-negative")
return principal * math.pow(1 + rate, years)
"""
Sandboxed Execution with Resource Limits
Never run LLM-generated code in your main process without sandboxing. Use subprocess isolation with resource limits.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import subprocess
import tempfile
import json
class Sandbox:
"""Execute untrusted code in an isolated subprocess."""
def __init__(self, timeout: int = 5, max_memory_mb: int = 128):
self.timeout = timeout
self.max_memory_mb = max_memory_mb
def test_tool(self, source_code: str, test_cases: list[dict]) -> dict:
"""Run tool code against test cases in isolation."""
wrapper = f"""
import json, resource, sys
# Set memory limit
resource.setrlimit(resource.RLIMIT_AS,
({self.max_memory_mb} * 1024 * 1024, {self.max_memory_mb} * 1024 * 1024))
# Load the tool
{source_code}
# Run test cases
test_cases = {json.dumps(test_cases)}
results = []
for tc in test_cases:
try:
result = {source_code.split('def ')[1].split('(')[0]}(**tc["inputs"])
results.append({{"passed": result == tc["expected"], "output": str(result)}})
except Exception as e:
results.append({{"passed": False, "error": str(e)}})
print(json.dumps(results))
"""
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
f.write(wrapper)
f.flush()
try:
proc = subprocess.run(
["python3", f.name],
capture_output=True, text=True,
timeout=self.timeout,
)
return json.loads(proc.stdout)
except subprocess.TimeoutExpired:
return [{"passed": False, "error": "Execution timed out"}]
The Self-Extension Loop
Here is the complete flow: the agent receives a request, determines it needs a new tool, generates it, tests it, registers it, and uses it.
from agents import Agent, function_tool, Runner
import asyncio
registry = ToolRegistry()
sandbox = Sandbox()
@function_tool
async def create_tool(
tool_name: str,
tool_description: str,
source_code: str,
test_cases: str,
) -> str:
"""Create and register a new tool from generated Python code."""
cases = json.loads(test_cases)
# Step 1: Validate in sandbox
results = sandbox.test_tool(source_code, cases)
if not all(r.get("passed") for r in results):
return f"Tool failed tests: {results}. Fix and retry."
# Step 2: Register the tool
registry.register_dynamic(tool_name, source_code)
return f"Tool '{tool_name}' created and registered successfully."
@function_tool
async def use_dynamic_tool(tool_name: str, arguments: str) -> str:
"""Call a previously created dynamic tool."""
kwargs = json.loads(arguments)
result = registry.call(tool_name, **kwargs)
return json.dumps({"result": result})
agent = Agent(
name="Self-Extending Agent",
instructions="""You can create new tools when needed. Before creating a tool,
check if an existing tool can handle the request. When creating tools,
always include at least 2 test cases to validate correctness.""",
tools=[create_tool, use_dynamic_tool],
)
Persisting Tools Across Sessions
Store generated tools in a database so they survive restarts.
import sqlite3
class PersistentToolRegistry(ToolRegistry):
def __init__(self, db_path: str = "tools.db"):
super().__init__()
self.db = sqlite3.connect(db_path)
self.db.execute("""
CREATE TABLE IF NOT EXISTS dynamic_tools (
name TEXT PRIMARY KEY,
source_code TEXT,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
self._load_persisted_tools()
def _load_persisted_tools(self):
for row in self.db.execute("SELECT name, source_code FROM dynamic_tools"):
self.register_dynamic(row[0], row[1])
FAQ
Is it safe to let an LLM write executable code?
Not inherently — that is why sandboxing is non-negotiable. The combination of static analysis (AST validation to block dangerous imports and built-in calls), subprocess isolation with resource limits, and test-case validation before registration creates a defense-in-depth strategy. In production, use container-based sandboxes like gVisor or Firecracker for stronger isolation.
How do you prevent the agent from creating redundant tools?
Include a list_tools function tool that lets the agent inspect what is already registered. Add semantic descriptions to each tool and instruct the agent to search existing tools before generating new ones. You can also add an LLM-based similarity check that compares the new tool description against existing descriptions.
What happens when a dynamically created tool has a subtle bug?
The test-case validation catches many bugs, but edge cases can slip through. Implement runtime monitoring that tracks tool call success rates. If a dynamic tool starts failing above a threshold, automatically quarantine it and alert the agent to regenerate it with additional test cases covering the failure scenarios.
#SelfExtendingAI #DynamicTools #CodeGeneration #AIAgents #Sandboxing #PythonMetaprogramming #AgentArchitecture #ToolCreation
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.