Building an AI Software Engineer: Lessons from SWE-bench
Analysis of the SWE-bench benchmark for AI coding agents, what it reveals about the state of automated software engineering, and practical lessons for building production coding assistants from the top-performing systems.
What Is SWE-bench and Why Does It Matter?
SWE-bench is a benchmark created by researchers at Princeton that tests whether AI systems can solve real software engineering tasks. Unlike coding benchmarks like HumanEval that test isolated function generation, SWE-bench presents the AI with actual GitHub issues from popular open-source Python repositories and asks it to produce a patch that resolves the issue and passes the repository's test suite.
The benchmark includes over 2,000 tasks drawn from repositories like Django, Flask, scikit-learn, matplotlib, sympy, and requests. Each task requires the AI to understand a bug report or feature request, navigate a large codebase (often hundreds of thousands of lines), identify the relevant files, and produce a working code change.
As of early 2026, the best-performing systems achieve roughly 50-60% on the full SWE-bench benchmark and around 40-50% on SWE-bench Verified (a human-validated subset designed to filter out ambiguous tasks). These numbers represent a dramatic improvement from late 2023 when the best systems scored around 4%.
What SWE-bench Tests vs. What It Does Not
What it tests well:
- Bug localization in large codebases
- Understanding natural-language issue descriptions
- Reading and comprehending existing code
- Generating minimal, correct patches
- Test-driven development (patches must pass existing tests)
What it does not test:
- Writing code from scratch for new projects
- System design and architecture decisions
- Multi-file refactoring across deeply connected modules
- Performance optimization
- Long-running tasks requiring hours of human developer time
- Collaboration, communication, and code review
Understanding these boundaries is critical because many teams extrapolate SWE-bench scores into claims about "AI software engineers" that go far beyond what the benchmark measures.
Architecture of Top-Performing Systems
The systems that perform best on SWE-bench share a common architecture with three components: a retrieval layer, a planning layer, and an execution layer.
The Retrieval Layer
The first challenge is finding the relevant code in a large repository. Top systems use a combination of techniques.
class CodebaseRetriever:
"""Multi-strategy code retrieval for large repositories."""
def __init__(self, repo_path: str):
self.repo_path = repo_path
self.file_index = self._build_file_index()
def retrieve_context(self, issue_description: str) -> list[str]:
"""Find relevant files using multiple strategies."""
candidates = set()
# Strategy 1: Keyword extraction from issue text
keywords = self._extract_keywords(issue_description)
candidates.update(self._grep_search(keywords))
# Strategy 2: File path mentions in the issue
mentioned_files = self._extract_file_paths(issue_description)
candidates.update(mentioned_files)
# Strategy 3: Stack trace parsing
stack_files = self._parse_stack_traces(issue_description)
candidates.update(stack_files)
# Strategy 4: Semantic search over function/class names
semantic_matches = self._semantic_search(issue_description)
candidates.update(semantic_matches)
# Rank by relevance and return top results
ranked = self._rank_candidates(candidates, issue_description)
return ranked[:20] # Top 20 most relevant files
def _extract_keywords(self, text: str) -> list[str]:
"""Extract technical keywords from issue description."""
# Use an LLM to extract the most relevant search terms
response = client.messages.create(
model="claude-haiku-4-20250514",
max_tokens=256,
messages=[{
"role": "user",
"content": f"Extract 5-10 technical keywords or function names "
f"from this bug report that would help locate the "
f"relevant source code:\n\n{text}"
}]
)
return response.content[0].text.strip().split("\n")
The Planning Layer
Once the relevant code is retrieved, the system plans its approach before writing any code.
async def plan_fix(issue: str, relevant_files: dict[str, str]) -> dict:
"""Generate a fix plan before writing code."""
file_context = "\n".join(
f"=== {path} ===\n{content[:3000]}" # Truncate large files
for path, content in relevant_files.items()
)
response = await async_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{
"role": "user",
"content": f"""Analyze this GitHub issue and plan a fix.
## Issue
{issue}
## Relevant Source Code
{file_context}
Create a fix plan:
1. Root cause analysis - what exactly is broken and why
2. Which file(s) need to be changed
3. What specific changes are needed in each file
4. What edge cases should the fix handle
5. How to verify the fix is correct"""
}]
)
return {"plan": response.content[0].text}
The Execution Layer
The execution layer generates the actual patch. The best systems iterate: generate a patch, run tests, and if tests fail, analyze the failure and try again.
async def iterative_fix(
issue: str,
plan: str,
repo_path: str,
max_attempts: int = 3
) -> dict:
"""Generate and iteratively refine a fix."""
attempts = []
for attempt in range(max_attempts):
# Generate or refine the patch
if attempt == 0:
prompt = f"Based on this plan, generate a git diff:\n{plan}"
else:
last_failure = attempts[-1]["test_output"]
prompt = (
f"The previous fix attempt failed. Test output:\n{last_failure}"
f"\n\nRevise the fix to address the test failure."
)
response = await async_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=8096,
messages=[{"role": "user", "content": prompt}]
)
patch = extract_diff(response.content[0].text)
# Apply and test
apply_result = apply_patch(repo_path, patch)
if not apply_result.success:
attempts.append({"patch": patch, "error": "patch_failed"})
continue
test_result = run_tests(repo_path)
attempts.append({
"patch": patch,
"test_passed": test_result.passed,
"test_output": test_result.output
})
if test_result.passed:
return {"success": True, "patch": patch, "attempts": len(attempts)}
return {"success": False, "attempts": attempts}
Key Lessons from the Top Systems
Lesson 1: Retrieval Quality Trumps Reasoning Quality
The single biggest predictor of success is whether the system finds the right files. If the correct file is in the context, even smaller models can often generate the fix. If the correct file is missing, even the strongest model will hallucinate a plausible but wrong solution.
Top systems spend 40-50% of their compute budget on retrieval and context construction, not on code generation.
Lesson 2: Iterative Refinement Adds 15-20% Accuracy
Systems that run tests after generating a patch and iterate on failures outperform single-shot systems by 15-20 percentage points. The key insight is that test error messages are highly informative. A failing test tells the model exactly what is still wrong.
Lesson 3: Planning Before Coding Matters
Systems that generate an explicit plan before writing code outperform those that go directly from issue to patch. The planning step forces the model to commit to a root cause hypothesis before getting lost in code generation.
Lesson 4: Context Window Management Is Critical
Real repositories have files with thousands of lines. Naively stuffing entire files into the context window wastes tokens and dilutes the model's attention. Top systems carefully select which functions, classes, and code sections to include.
Lesson 5: The Hardest Tasks Require Architectural Understanding
The tasks where all systems fail typically require understanding how multiple modules interact. Fixing a bug in Django's ORM that manifests in the template rendering layer requires understanding the full request lifecycle. Current systems struggle with this level of architectural reasoning.
From SWE-bench to Production Coding Assistants
SWE-bench optimizes for a specific scenario: given an issue and a test suite, produce a patch. Production coding assistants face additional challenges.
- No test suite: Many real-world bugs do not have corresponding tests. The agent must generate both the fix and the verification.
- Ambiguous requirements: Real feature requests are vague. SWE-bench issues are relatively well-specified.
- Multi-language codebases: SWE-bench is Python-only. Production systems must handle TypeScript, Go, Rust, and mixed environments.
- Long-running context: Developers interact with coding assistants over hours. Context management across long sessions is a different problem than single-task patching.
The lessons from SWE-bench are still valuable for production systems, but they must be adapted with these differences in mind.
Summary
SWE-bench has become the standard benchmark for evaluating AI coding agents, and the rapid progress from 4% to over 50% accuracy in just two years demonstrates the potential of agentic approaches to software engineering. The key architectural patterns are multi-strategy retrieval, explicit planning before coding, and iterative refinement with test feedback. For teams building production coding assistants, the most transferable lesson is that finding the right code to examine matters more than how smart the model is at generating patches.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.