Building AI Agents That Write and Deploy Their Own Tools: Self-Extending Agent Systems

The Limitation of Static Tool Sets

Every agent framework requires you to pre-define tools. You write Python functions, decorate them, and register them with the agent at initialization time. The agent can only do what its tools allow. If a user asks for something no tool covers, the agent either hallucinates an answer or says "I cannot do that."

Self-extending agents break this limitation. When the agent encounters a task that its current tools cannot handle, it writes a new tool — a Python function — validates it in a sandbox, registers it, and immediately uses it. The next time a similar task appears, the tool is already available.

Architecture of a Self-Extending Agent

The system has four components: a code generation module that writes tool functions, a sandbox that executes untrusted code safely, a tool registry that manages dynamic tools, and the agent loop that ties them together.

import ast
import importlib
import types
from typing import Any

class ToolRegistry:
    """Manages both static and dynamically created tools."""

    def __init__(self):
        self.tools: dict[str, callable] = {}
        self.tool_source: dict[str, str] = {}

    def register_static(self, name: str, fn: callable):
        self.tools[name] = fn

    def register_dynamic(self, name: str, source_code: str):
        """Compile and register a dynamically generated tool."""
        # Validate the code is safe before execution
        self._validate_code(source_code)

        # Compile and execute in a restricted namespace
        namespace: dict[str, Any] = {}
        exec(compile(source_code, f"<dynamic:{name}>", "exec"), namespace)

        if name not in namespace:
            raise ValueError(f"Source code must define a function named '{name}'")

        self.tools[name] = namespace[name]
        self.tool_source[name] = source_code

    def _validate_code(self, source: str):
        """Static analysis to block dangerous operations."""
        tree = ast.parse(source)
        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for alias in node.names:
                    if alias.name in ("os", "subprocess", "shutil", "sys"):
                        raise SecurityError(f"Import of '{alias.name}' is blocked")
            if isinstance(node, ast.Call):
                if isinstance(node.func, ast.Name):
                    if node.func.id in ("exec", "eval", "compile", "__import__"):
                        raise SecurityError(f"Call to '{node.func.id}' is blocked")

    def list_tools(self) -> list[str]:
        return list(self.tools.keys())

    def call(self, name: str, **kwargs) -> Any:
        if name not in self.tools:
            raise KeyError(f"Tool '{name}' not found")
        return self.tools[name](**kwargs)

class SecurityError(Exception):
    pass

The Code Generation Prompt

The agent needs a specialized tool that generates other tools. The prompt engineering here is critical — the LLM must produce well-structured, safe Python functions.

TOOL_GENERATION_PROMPT = """You are a tool-writing assistant. When asked to create a new tool,
output ONLY a Python function with the following requirements:

1. The function must have a clear docstring describing what it does
2. All parameters must have type annotations
3. The function must return a value (not print)
4. Only use these allowed imports: math, json, re, datetime, collections, statistics
5. The function name must be snake_case
6. Include input validation

Example format:

import math

def calculate_compound_interest(principal: float, rate: float, years: int) -> float:
    """Calculate compound interest given principal, annual rate, and years."""
    if principal < 0 or rate < 0 or years < 0:
        raise ValueError("All values must be non-negative")
    return principal * math.pow(1 + rate, years)
"""

Sandboxed Execution with Resource Limits

Never run LLM-generated code in your main process without sandboxing. Use subprocess isolation with resource limits.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import subprocess
import tempfile
import json

class Sandbox:
    """Execute untrusted code in an isolated subprocess."""

    def __init__(self, timeout: int = 5, max_memory_mb: int = 128):
        self.timeout = timeout
        self.max_memory_mb = max_memory_mb

    def test_tool(self, source_code: str, test_cases: list[dict]) -> dict:
        """Run tool code against test cases in isolation."""
        wrapper = f"""
import json, resource, sys

# Set memory limit
resource.setrlimit(resource.RLIMIT_AS,
    ({self.max_memory_mb} * 1024 * 1024, {self.max_memory_mb} * 1024 * 1024))

# Load the tool
{source_code}

# Run test cases
test_cases = {json.dumps(test_cases)}
results = []
for tc in test_cases:
    try:
        result = {source_code.split('def ')[1].split('(')[0]}(**tc["inputs"])
        results.append({{"passed": result == tc["expected"], "output": str(result)}})
    except Exception as e:
        results.append({{"passed": False, "error": str(e)}})

print(json.dumps(results))
"""
        with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
            f.write(wrapper)
            f.flush()

            try:
                proc = subprocess.run(
                    ["python3", f.name],
                    capture_output=True, text=True,
                    timeout=self.timeout,
                )
                return json.loads(proc.stdout)
            except subprocess.TimeoutExpired:
                return [{"passed": False, "error": "Execution timed out"}]

The Self-Extension Loop

Here is the complete flow: the agent receives a request, determines it needs a new tool, generates it, tests it, registers it, and uses it.

from agents import Agent, function_tool, Runner
import asyncio

registry = ToolRegistry()
sandbox = Sandbox()

@function_tool
async def create_tool(
    tool_name: str,
    tool_description: str,
    source_code: str,
    test_cases: str,
) -> str:
    """Create and register a new tool from generated Python code."""
    cases = json.loads(test_cases)

    # Step 1: Validate in sandbox
    results = sandbox.test_tool(source_code, cases)
    if not all(r.get("passed") for r in results):
        return f"Tool failed tests: {results}. Fix and retry."

    # Step 2: Register the tool
    registry.register_dynamic(tool_name, source_code)

    return f"Tool '{tool_name}' created and registered successfully."

@function_tool
async def use_dynamic_tool(tool_name: str, arguments: str) -> str:
    """Call a previously created dynamic tool."""
    kwargs = json.loads(arguments)
    result = registry.call(tool_name, **kwargs)
    return json.dumps({"result": result})

agent = Agent(
    name="Self-Extending Agent",
    instructions="""You can create new tools when needed. Before creating a tool,
    check if an existing tool can handle the request. When creating tools,
    always include at least 2 test cases to validate correctness.""",
    tools=[create_tool, use_dynamic_tool],
)

Persisting Tools Across Sessions

Store generated tools in a database so they survive restarts.

import sqlite3

class PersistentToolRegistry(ToolRegistry):
    def __init__(self, db_path: str = "tools.db"):
        super().__init__()
        self.db = sqlite3.connect(db_path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS dynamic_tools (
                name TEXT PRIMARY KEY,
                source_code TEXT,
                description TEXT,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self._load_persisted_tools()

    def _load_persisted_tools(self):
        for row in self.db.execute("SELECT name, source_code FROM dynamic_tools"):
            self.register_dynamic(row[0], row[1])

FAQ

Is it safe to let an LLM write executable code?

Not inherently — that is why sandboxing is non-negotiable. The combination of static analysis (AST validation to block dangerous imports and built-in calls), subprocess isolation with resource limits, and test-case validation before registration creates a defense-in-depth strategy. In production, use container-based sandboxes like gVisor or Firecracker for stronger isolation.

How do you prevent the agent from creating redundant tools?

Include a list_tools function tool that lets the agent inspect what is already registered. Add semantic descriptions to each tool and instruct the agent to search existing tools before generating new ones. You can also add an LLM-based similarity check that compares the new tool description against existing descriptions.

What happens when a dynamically created tool has a subtle bug?

The test-case validation catches many bugs, but edge cases can slip through. Implement runtime monitoring that tracks tool call success rates. If a dynamic tool starts failing above a threshold, automatically quarantine it and alert the agent to regenerate it with additional test cases covering the failure scenarios.

#SelfExtendingAI #DynamicTools #CodeGeneration #AIAgents #Sandboxing #PythonMetaprogramming #AgentArchitecture #ToolCreation

Building AI Agents That Write and Deploy Their Own Tools: Self-Extending Agent Systems

The Limitation of Static Tool Sets

Architecture of a Self-Extending Agent

The Code Generation Prompt

Sandboxed Execution with Resource Limits

The Self-Extension Loop

Persisting Tools Across Sessions

FAQ

Is it safe to let an LLM write executable code?

How do you prevent the agent from creating redundant tools?

What happens when a dynamically created tool has a subtle bug?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding