Building a Code Generation Agent: From Prompt to Working Code
Learn how to build an AI agent that transforms natural language requirements into working, tested code. Covers prompt decomposition, language selection, code validation, and automatic test generation.
Why Code Generation Agents Matter
Writing code from scratch is time-consuming. A code generation agent takes a natural language description of what you need, decomposes it into implementable steps, produces syntactically correct code, and validates the result by running tests. Unlike simple autocomplete tools, a true code generation agent reasons about architecture, selects appropriate patterns, and iterates until the output actually works.
The key difference between a naive generate code prompt and an agent is the loop. An agent generates, validates, receives feedback, and regenerates until quality criteria are met.
Architecture of a Code Generation Agent
A well-structured code generation agent has four stages: requirement parsing, code generation, validation, and iteration. Each stage feeds into the next, creating a self-correcting pipeline.
import ast
import subprocess
import tempfile
from dataclasses import dataclass, field
from openai import OpenAI
client = OpenAI()
@dataclass
class CodeGenResult:
code: str
tests: str
language: str
passed: bool
errors: list[str] = field(default_factory=list)
class CodeGenerationAgent:
def __init__(self, model: str = "gpt-4o"):
self.model = model
self.max_iterations = 3
def generate(self, requirement: str) -> CodeGenResult:
language = self._detect_language(requirement)
code = self._generate_code(requirement, language)
tests = self._generate_tests(requirement, code, language)
result = CodeGenResult(
code=code, tests=tests,
language=language, passed=False,
)
for attempt in range(self.max_iterations):
validation = self._validate(result)
if validation["passed"]:
result.passed = True
break
result = self._fix_code(result, validation["errors"])
return result
The generate method orchestrates the full pipeline. Notice the iteration loop: if validation fails, the agent feeds errors back into the LLM and tries again, up to a maximum number of attempts.
Requirement Parsing and Language Detection
Before generating any code, the agent must understand what is being asked and in which language the solution should be written.
def _detect_language(self, requirement: str) -> str:
response = client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": (
"Determine the programming language for this task. "
"Respond with only the language name in lowercase. "
"If not specified, default to python."
)},
{"role": "user", "content": requirement},
],
temperature=0,
)
return response.choices[0].message.content.strip().lower()
Code Generation with Structured Prompting
The core generation step uses a carefully structured system prompt that enforces coding standards and produces clean, documented output.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def _generate_code(self, requirement: str, language: str) -> str:
system_prompt = f"""You are an expert {language} developer.
Generate production-quality code for the given requirement.
Rules:
- Include type hints and docstrings
- Handle edge cases and errors
- Follow {language} conventions and idioms
- Do NOT include test code in your output
- Output ONLY the code, no markdown fences"""
response = client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": requirement},
],
temperature=0.2,
)
return response.choices[0].message.content.strip()
Low temperature keeps the output deterministic and reduces hallucinated imports or nonexistent APIs.
Automatic Test Generation
The agent generates tests that exercise the generated code, covering happy paths and edge cases.
def _generate_tests(self, requirement: str, code: str, language: str) -> str:
system_prompt = f"""Write {language} unit tests for the provided code.
Use pytest conventions. Cover:
- Normal inputs and expected outputs
- Edge cases (empty input, None, boundary values)
- Error conditions
Output ONLY test code, no markdown fences."""
response = client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Requirement: {requirement}\n\nCode:\n{code}"},
],
temperature=0.2,
)
return response.choices[0].message.content.strip()
Validation and Self-Correction
The validation step actually runs the generated code and tests, capturing any errors for the next iteration.
def _validate(self, result: CodeGenResult) -> dict:
if result.language != "python":
return self._syntax_check(result)
with tempfile.TemporaryDirectory() as tmpdir:
code_path = f"{tmpdir}/solution.py"
test_path = f"{tmpdir}/test_solution.py"
with open(code_path, "w") as f:
f.write(result.code)
with open(test_path, "w") as f:
f.write(f"from solution import *\n\n{result.tests}")
proc = subprocess.run(
["python", "-m", "pytest", test_path, "-v", "--tb=short"],
capture_output=True, text=True, timeout=30,
cwd=tmpdir,
)
passed = proc.returncode == 0
errors = [] if passed else [proc.stdout + proc.stderr]
return {"passed": passed, "errors": errors}
This is the crucial piece that separates an agent from a simple prompt. The code runs in an isolated temporary directory with a timeout to prevent runaway processes.
FAQ
How do I prevent the agent from generating unsafe code like file deletions or network calls?
Use a sandboxed execution environment. Run validation inside a Docker container or a restricted subprocess with limited permissions. You can also add a static analysis step before execution that scans for dangerous imports like os.system, subprocess, or shutil.rmtree.
What if the agent keeps failing after the maximum iterations?
Return the best attempt along with the remaining errors so a human can intervene. Log each iteration's code and errors for debugging. In production, you would also track failure rates per requirement type to identify systematic weaknesses in your prompts.
Can this approach work for languages other than Python?
Yes, but validation becomes harder. For compiled languages like Go or Rust, you need their toolchains available in the execution environment. For JavaScript, you can use Node.js. The generation and test creation prompts work across languages with minor adjustments.
#CodeGeneration #AIAgents #Python #DeveloperTools #LLM #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.