OpenAI Codex Agent Mode: Autonomous Coding with GPT-5.4 in Production

Codex Is More Than Code Completion

OpenAI Codex has evolved from an autocomplete engine into a full autonomous coding agent. In its 2026 incarnation, Codex operates as an agentic system that can read codebases, plan changes, write code, run tests, and iterate on failures — all without human intervention. The underlying architecture uses GPT-5.4 as the primary reasoning model and GPT-5.4 mini as a subagent for fast, parallel subtasks.

Understanding how Codex works internally is valuable not just for using the tool but for learning architectural patterns you can apply to your own coding agents.

The Codex Agent Architecture

Codex's architecture follows a supervisor-worker pattern. The main agent (powered by GPT-5.4) handles high-level planning, code understanding, and complex reasoning. Subagents (powered by GPT-5.4 mini) handle parallelizable tasks like file reading, test execution, and simple code transformations.

# Conceptual architecture of a Codex-style coding agent

from agents import Agent, Runner, function_tool, handoff
import subprocess
import os

# ─── File System Tools ───
@function_tool
def read_file(path: str) -> str:
    """Read a file from the workspace."""
    try:
        with open(path, 'r') as f:
            content = f.read()
        lines = content.split('\n')
        numbered = [f"{i+1}: {line}" for i, line in enumerate(lines)]
        return '\n'.join(numbered)
    except FileNotFoundError:
        return f"File not found: {path}"

@function_tool
def write_file(path: str, content: str) -> str:
    """Write content to a file in the workspace."""
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, 'w') as f:
        f.write(content)
    return f"Written {len(content)} bytes to {path}"

@function_tool
def list_directory(path: str) -> str:
    """List files and directories at the given path."""
    try:
        entries = os.listdir(path)
        return '\n'.join(sorted(entries))
    except FileNotFoundError:
        return f"Directory not found: {path}"

# ─── Execution Tools ───
@function_tool
def run_command(command: str, cwd: str = ".") -> str:
    """Run a shell command and return stdout/stderr."""
    try:
        result = subprocess.run(
            command,
            shell=True,
            cwd=cwd,
            capture_output=True,
            text=True,
            timeout=30
        )
        output = ""
        if result.stdout:
            output += f"STDOUT:\n{result.stdout}\n"
        if result.stderr:
            output += f"STDERR:\n{result.stderr}\n"
        output += f"Exit code: {result.returncode}"
        return output
    except subprocess.TimeoutExpired:
        return "Command timed out after 30 seconds"

@function_tool
def run_tests(test_path: str = "") -> str:
    """Run the project's test suite."""
    cmd = f"python -m pytest {test_path} -v --tb=short"
    return run_command.fn(command=cmd)

# ─── Search Tools ───
@function_tool
def grep_codebase(pattern: str, file_glob: str = "*.py") -> str:
    """Search for a pattern across the codebase."""
    cmd = f'grep -rn "{pattern}" --include="{file_glob}" .'
    return run_command.fn(command=cmd)

The Planning Phase

Before writing any code, a Codex-style agent performs a planning phase. This is where GPT-5.4's deep reasoning capabilities shine. The agent reads relevant files, understands the existing architecture, and produces a step-by-step plan.

# The main coding agent - uses GPT-5.4 for reasoning
coding_agent = Agent(
    name="Codex Main Agent",
    instructions="""You are an autonomous coding agent. When given a task:

    PHASE 1 - UNDERSTAND:
    1. Read the relevant files to understand current code structure
    2. Search for related patterns in the codebase (grep)
    3. Identify the specific files that need changes

    PHASE 2 - PLAN:
    4. Create a step-by-step plan for the changes
    5. Consider edge cases and potential breaking changes
    6. Identify which tests need to be added or updated

    PHASE 3 - IMPLEMENT:
    7. Make the code changes file by file
    8. Follow existing code patterns and conventions
    9. Add proper error handling and type hints

    PHASE 4 - VERIFY:
    10. Run the test suite
    11. If tests fail, read the errors and fix them
    12. Iterate until all tests pass

    Always explain your reasoning before making changes.
    Never modify files outside the scope of the task.""",
    tools=[
        read_file,
        write_file,
        list_directory,
        run_command,
        run_tests,
        grep_codebase
    ],
    model="gpt-5.4",
    model_settings={"temperature": 0.1}
)

The Subagent Pattern

The key architectural innovation in Codex is the use of subagents for parallel work. When the main agent needs to understand a codebase, it does not read every file sequentially. Instead, it dispatches GPT-5.4 mini subagents to read and summarize files in parallel.

from agents import Agent, Runner
import asyncio

# Subagent for fast file analysis
file_analyzer = Agent(
    name="File Analyzer",
    instructions="""Analyze the provided source file and return a structured
    summary:
    - Purpose of the file (1 sentence)
    - Key classes/functions with their signatures
    - External dependencies imported
    - Public API surface

    Be concise. No more than 200 words.""",
    model="gpt-5.4-mini"
)

async def analyze_codebase(file_paths: list[str]) -> dict[str, str]:
    """Analyze multiple files in parallel using subagents."""

    async def analyze_one(path: str) -> tuple[str, str]:
        with open(path, 'r') as f:
            content = f.read()

        result = await Runner.run(
            file_analyzer,
            f"Analyze this file ({path}):\n\n{content}"
        )
        return path, result.final_output

    # Run all analyses in parallel
    tasks = [analyze_one(path) for path in file_paths]
    results = await asyncio.gather(*tasks)

    return dict(results)

# Usage: analyze 20 files in ~2 seconds instead of ~20 seconds
summaries = asyncio.run(analyze_codebase([
    "app/main.py",
    "app/models.py",
    "app/routes/users.py",
    "app/routes/orders.py",
    "app/services/payment.py",
    # ...
]))

# Feed summaries to the main agent for planning
context = "\n\n".join(
    f"=== {path} ===\n{summary}"
    for path, summary in summaries.items()
)

This pattern reduces codebase analysis time from O(n) sequential reads to O(1) parallel reads, dramatically accelerating the planning phase.

Sandboxed Execution: Security for Autonomous Coding

A critical aspect of production coding agents is sandboxing. Codex executes all code in isolated containers with no network access and restricted filesystem permissions. Here is how to implement a similar pattern:

import docker
import tempfile
import os

class SandboxedExecutor:
    def __init__(self, workspace_path: str):
        self.client = docker.from_env()
        self.workspace = workspace_path
        self.image = "python:3.12-slim"

    def execute(self, command: str, timeout: int = 30) -> dict:
        """Run a command in an isolated Docker container."""
        try:
            container = self.client.containers.run(
                self.image,
                command=f"bash -c '{command}'",
                volumes={
                    self.workspace: {
                        "bind": "/workspace",
                        "mode": "rw"
                    }
                },
                working_dir="/workspace",
                network_mode="none",  # No network access
                mem_limit="512m",
                cpu_period=100000,
                cpu_quota=50000,  # 50% CPU
                remove=True,
                detach=False,
                stdout=True,
                stderr=True,
                timeout=timeout
            )
            return {
                "stdout": container.decode("utf-8"),
                "exit_code": 0
            }
        except docker.errors.ContainerError as e:
            return {
                "stderr": e.stderr.decode("utf-8"),
                "exit_code": e.exit_status
            }
        except docker.errors.APIError as e:
            return {
                "stderr": str(e),
                "exit_code": -1
            }

# Integration with the coding agent
sandbox = SandboxedExecutor("/tmp/agent_workspace")

@function_tool
def sandboxed_run(command: str) -> str:
    """Execute a command in a sandboxed environment."""
    result = sandbox.execute(command)
    output = result.get("stdout", "") + result.get("stderr", "")
    return f"{output}\nExit code: {result['exit_code']}"

Practical Patterns for Production Coding Agents

Pattern 1: Test-Driven Agent Loop

The most reliable pattern for coding agents is test-driven development. The agent writes tests first, then implements code, then iterates until tests pass.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

tdd_agent = Agent(
    name="TDD Coding Agent",
    instructions="""Follow strict test-driven development:

    1. FIRST write failing tests that define the expected behavior
    2. Run the tests to confirm they fail for the right reason
    3. Write the minimal implementation to make tests pass
    4. Run tests again - if they pass, you are done
    5. If tests fail, read the error, fix the code, and repeat from step 4

    Maximum 5 iterations of the fix-and-test loop. If tests still fail
    after 5 attempts, report what is failing and why.""",
    tools=[read_file, write_file, run_tests, grep_codebase],
    model="gpt-5.4"
)

Pattern 2: Diff-Based Output

Instead of rewriting entire files, instruct the agent to produce targeted diffs. This reduces token usage and makes changes easier to review.

diff_agent = Agent(
    name="Diff Agent",
    instructions="""When modifying code, output your changes as unified
    diffs. For each file you change, provide:

    1. The file path
    2. The exact lines being replaced (with line numbers for context)
    3. The replacement lines

    Use the write_file tool only after you have planned all changes.
    Read the file first, apply your diffs mentally, and write the complete
    updated file.""",
    tools=[read_file, write_file, grep_codebase],
    model="gpt-5.4"
)

Pattern 3: Codebase Indexing for Large Projects

For large codebases, build an index that the agent can query instead of reading files directly:

import hashlib
import json

class CodebaseIndex:
    def __init__(self):
        self.index: dict[str, dict] = {}

    def add_file(self, path: str, summary: str, symbols: list[str]):
        self.index[path] = {
            "summary": summary,
            "symbols": symbols,
            "hash": hashlib.md5(open(path, 'rb').read()).hexdigest()
        }

    def search(self, query: str) -> list[str]:
        """Find files relevant to a query based on summaries and symbols."""
        results = []
        query_lower = query.lower()
        for path, info in self.index.items():
            score = 0
            if query_lower in info["summary"].lower():
                score += 2
            for symbol in info["symbols"]:
                if query_lower in symbol.lower():
                    score += 1
            if score > 0:
                results.append((score, path))

        results.sort(reverse=True)
        return [path for _, path in results[:10]]

@function_tool
def search_codebase_index(query: str) -> str:
    """Search the codebase index for relevant files."""
    relevant_files = codebase_index.search(query)
    return json.dumps(relevant_files, indent=2)

Measuring Coding Agent Quality

Track these metrics to evaluate your coding agent's performance:

Resolve rate: Percentage of tasks where the agent produces code that passes all tests. Target 50% or above for production use.

Iteration count: Average number of fix-and-test cycles needed. Lower is better — one-shot success is the gold standard.

Token efficiency: Total tokens consumed per successful task completion. Monitor this to control costs.

Regression rate: How often the agent's changes break existing tests. Should be under 5% in a well-configured system.

import time
from dataclasses import dataclass

@dataclass
class AgentMetrics:
    task_id: str
    resolved: bool
    iterations: int
    total_tokens: int
    duration_seconds: float
    tests_broken: int

def evaluate_coding_agent(agent, tasks: list[dict]) -> list[AgentMetrics]:
    metrics = []
    for task in tasks:
        start = time.time()

        result = Runner.run_sync(agent, task["description"])

        # Run tests to check resolution
        test_result = run_tests.fn(test_path=task.get("test_path", ""))
        resolved = "passed" in test_result.lower() and "failed" not in test_result.lower()

        metrics.append(AgentMetrics(
            task_id=task["id"],
            resolved=resolved,
            iterations=result.metadata.get("iterations", 0),
            total_tokens=result.metadata.get("total_tokens", 0),
            duration_seconds=time.time() - start,
            tests_broken=test_result.count("FAILED")
        ))

    return metrics

FAQ

How does Codex handle large codebases that exceed the context window?

Codex uses a multi-phase approach. First, it builds an index of the codebase using GPT-5.4 mini subagents that summarize each file. Then, the main agent queries this index to identify the relevant files for a task. Only the relevant files are loaded into context. For very large changes spanning many files, Codex processes files in batches, maintaining a running state of what has been changed.

Can I build a Codex-like agent using the OpenAI Agents SDK?

Yes, and the patterns in this article give you the building blocks. The Agents SDK provides the agent loop, tool calling, and handoff infrastructure. You add the file system tools, sandboxed execution, and codebase indexing. The main architectural decisions are around sandboxing (use Docker), tool design (read/write/execute/search), and the planning-implementation-verification loop.

What prevents the coding agent from introducing security vulnerabilities?

Multiple layers of defense: sandboxed execution prevents the agent from accessing production systems, output guardrails can scan generated code for common vulnerability patterns (SQL injection, hardcoded secrets, insecure deserialization), and test suites catch functional regressions. In production systems, all agent-generated code goes through a human review step before merging.

How do I handle tasks that require changes across multiple repositories?

This is an active area of development. The current best practice is to structure each repository as a separate workspace with its own agent instance, and use a coordinator agent that plans the cross-repo changes and orchestrates the individual agents. The coordinator ensures that interface contracts between repositories remain consistent.