Building a Code Generation Agent: From Prompt to Working Code

Why Code Generation Agents Matter

Writing code from scratch is time-consuming. A code generation agent takes a natural language description of what you need, decomposes it into implementable steps, produces syntactically correct code, and validates the result by running tests. Unlike simple autocomplete tools, a true code generation agent reasons about architecture, selects appropriate patterns, and iterates until the output actually works.

The key difference between a naive generate code prompt and an agent is the loop. An agent generates, validates, receives feedback, and regenerates until quality criteria are met.

Architecture of a Code Generation Agent

A well-structured code generation agent has four stages: requirement parsing, code generation, validation, and iteration. Each stage feeds into the next, creating a self-correcting pipeline.

import ast
import subprocess
import tempfile
from dataclasses import dataclass, field
from openai import OpenAI

client = OpenAI()

@dataclass
class CodeGenResult:
    code: str
    tests: str
    language: str
    passed: bool
    errors: list[str] = field(default_factory=list)

class CodeGenerationAgent:
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self.max_iterations = 3

    def generate(self, requirement: str) -> CodeGenResult:
        language = self._detect_language(requirement)
        code = self._generate_code(requirement, language)
        tests = self._generate_tests(requirement, code, language)
        result = CodeGenResult(
            code=code, tests=tests,
            language=language, passed=False,
        )
        for attempt in range(self.max_iterations):
            validation = self._validate(result)
            if validation["passed"]:
                result.passed = True
                break
            result = self._fix_code(result, validation["errors"])
        return result

The generate method orchestrates the full pipeline. Notice the iteration loop: if validation fails, the agent feeds errors back into the LLM and tries again, up to a maximum number of attempts.

Requirement Parsing and Language Detection

Before generating any code, the agent must understand what is being asked and in which language the solution should be written.

def _detect_language(self, requirement: str) -> str:
    response = client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": (
                "Determine the programming language for this task. "
                "Respond with only the language name in lowercase. "
                "If not specified, default to python."
            )},
            {"role": "user", "content": requirement},
        ],
        temperature=0,
    )
    return response.choices[0].message.content.strip().lower()

Code Generation with Structured Prompting

The core generation step uses a carefully structured system prompt that enforces coding standards and produces clean, documented output.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def _generate_code(self, requirement: str, language: str) -> str:
    system_prompt = f"""You are an expert {language} developer.
Generate production-quality code for the given requirement.

Rules:
- Include type hints and docstrings
- Handle edge cases and errors
- Follow {language} conventions and idioms
- Do NOT include test code in your output
- Output ONLY the code, no markdown fences"""

    response = client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": requirement},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()

Low temperature keeps the output deterministic and reduces hallucinated imports or nonexistent APIs.

Automatic Test Generation

The agent generates tests that exercise the generated code, covering happy paths and edge cases.

def _generate_tests(self, requirement: str, code: str, language: str) -> str:
    system_prompt = f"""Write {language} unit tests for the provided code.
Use pytest conventions. Cover:
- Normal inputs and expected outputs
- Edge cases (empty input, None, boundary values)
- Error conditions
Output ONLY test code, no markdown fences."""

    response = client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Requirement: {requirement}\n\nCode:\n{code}"},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()

Validation and Self-Correction

The validation step actually runs the generated code and tests, capturing any errors for the next iteration.

def _validate(self, result: CodeGenResult) -> dict:
    if result.language != "python":
        return self._syntax_check(result)
    with tempfile.TemporaryDirectory() as tmpdir:
        code_path = f"{tmpdir}/solution.py"
        test_path = f"{tmpdir}/test_solution.py"
        with open(code_path, "w") as f:
            f.write(result.code)
        with open(test_path, "w") as f:
            f.write(f"from solution import *\n\n{result.tests}")
        proc = subprocess.run(
            ["python", "-m", "pytest", test_path, "-v", "--tb=short"],
            capture_output=True, text=True, timeout=30,
            cwd=tmpdir,
        )
        passed = proc.returncode == 0
        errors = [] if passed else [proc.stdout + proc.stderr]
        return {"passed": passed, "errors": errors}

This is the crucial piece that separates an agent from a simple prompt. The code runs in an isolated temporary directory with a timeout to prevent runaway processes.

FAQ

How do I prevent the agent from generating unsafe code like file deletions or network calls?

Use a sandboxed execution environment. Run validation inside a Docker container or a restricted subprocess with limited permissions. You can also add a static analysis step before execution that scans for dangerous imports like os.system, subprocess, or shutil.rmtree.

What if the agent keeps failing after the maximum iterations?

Return the best attempt along with the remaining errors so a human can intervene. Log each iteration's code and errors for debugging. In production, you would also track failure rates per requirement type to identify systematic weaknesses in your prompts.

Can this approach work for languages other than Python?

Yes, but validation becomes harder. For compiled languages like Go or Rust, you need their toolchains available in the execution environment. For JavaScript, you can use Node.js. The generation and test creation prompts work across languages with minor adjustments.

#CodeGeneration #AIAgents #Python #DeveloperTools #LLM #AgenticAI #LearnAI #AIEngineering

Building a Code Generation Agent: From Prompt to Working Code

Why Code Generation Agents Matter

Architecture of a Code Generation Agent

Requirement Parsing and Language Detection

Code Generation with Structured Prompting

Automatic Test Generation

Validation and Self-Correction

FAQ

How do I prevent the agent from generating unsafe code like file deletions or network calls?

What if the agent keeps failing after the maximum iterations?

Can this approach work for languages other than Python?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding