Test Generation with AI Agents: Automatic Unit and Integration Test Creation

Why AI-Generated Tests Are Different from Copilot Suggestions

Inline code suggestions can produce individual test functions, but they lack awareness of your overall test strategy. A test generation agent analyzes your entire codebase, identifies what is already covered, finds untested code paths, and produces a coherent test suite that fills the gaps. It reasons about which tests provide the most value rather than blindly generating tests for every function.

Analyzing Code Coverage Gaps

The agent starts by understanding what exists and what is missing. It combines static analysis with coverage data to identify high-value testing targets.

import ast
import json
import subprocess
from dataclasses import dataclass
from openai import OpenAI

client = OpenAI()

@dataclass
class TestTarget:
    file_path: str
    function_name: str
    source_code: str
    complexity: int
    has_existing_tests: bool

class TestGenerationAgent:
    def __init__(self, source_dir: str, test_dir: str, model: str = "gpt-4o"):
        self.source_dir = source_dir
        self.test_dir = test_dir
        self.model = model

    def find_untested_functions(self) -> list[TestTarget]:
        coverage = self._run_coverage()
        targets = []
        import os
        for root, _, files in os.walk(self.source_dir):
            for fname in files:
                if not fname.endswith(".py") or fname.startswith("test_"):
                    continue
                path = os.path.join(root, fname)
                with open(path) as f:
                    source = f.read()
                tree = ast.parse(source)
                for node in ast.walk(tree):
                    if not isinstance(node, ast.FunctionDef):
                        continue
                    func_lines = set(range(
                        node.lineno, node.end_lineno + 1
                    ))
                    uncovered = coverage.get(path, set())
                    has_gap = bool(func_lines & uncovered)
                    targets.append(TestTarget(
                        file_path=path,
                        function_name=node.name,
                        source_code=ast.get_source_segment(source, node),
                        complexity=self._calc_complexity(node),
                        has_existing_tests=not has_gap,
                    ))
        targets.sort(key=lambda t: t.complexity, reverse=True)
        return [t for t in targets if not t.has_existing_tests]

The agent prioritizes high-complexity functions without test coverage, ensuring the generated tests deliver maximum value.

Generating Tests with Proper Fixtures

Good tests need proper setup and teardown. The agent generates pytest fixtures alongside the test functions.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def generate_tests(self, target: TestTarget) -> str:
    system_prompt = """You are an expert Python test engineer.
Generate pytest tests for the provided function.

REQUIREMENTS:
- Use pytest fixtures for setup and teardown
- Test the happy path first, then edge cases
- Include at least one test for error handling
- Use descriptive test names: test_<function>_<scenario>
- Add brief docstrings explaining what each test verifies
- Use pytest.raises for expected exceptions
- Use parametrize for testing multiple inputs
- Mock external dependencies (database, APIs, filesystem)
- Do NOT import the function being tested. I will add the import.

Output ONLY Python test code. No markdown fences."""

    context = self._gather_context(target)

    response = client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": (
                f"Function to test:\n{target.source_code}\n\n"
                f"Context (related types and imports):\n{context}"
            )},
        ],
        temperature=0.2,
    )
    test_code = response.choices[0].message.content.strip()
    module_path = self._to_import_path(target.file_path)
    import_line = f"from {module_path} import {target.function_name}"
    return f"{import_line}\n\n{test_code}"

Validating Test Quality

Generated tests must actually run and assert meaningful things. The agent validates each test suite by executing it and analyzing the results.

def validate_tests(self, test_code: str, target: TestTarget) -> dict:
    import tempfile, os
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".py", dir=self.test_dir,
        prefix="test_gen_", delete=False,
    ) as f:
        f.write(test_code)
        test_path = f.name

    try:
        result = subprocess.run(
            ["python", "-m", "pytest", test_path, "-v", "--tb=short"],
            capture_output=True, text=True, timeout=60,
        )
        passed = result.returncode == 0
        test_count = result.stdout.count(" PASSED")
        fail_count = result.stdout.count(" FAILED")

        quality = self._assess_quality(test_code)

        return {
            "passed": passed,
            "tests_passed": test_count,
            "tests_failed": fail_count,
            "output": result.stdout[-2000:],
            "quality_score": quality,
        }
    finally:
        os.unlink(test_path)

def _assess_quality(self, test_code: str) -> dict:
    has_parametrize = "@pytest.mark.parametrize" in test_code
    has_fixtures = "@pytest.fixture" in test_code
    has_edge_cases = any(
        kw in test_code.lower()
        for kw in ["none", "empty", "zero", "negative", "boundary"]
    )
    assertion_count = test_code.count("assert ")
    mock_count = test_code.count("mock") + test_code.count("patch")

    return {
        "uses_parametrize": has_parametrize,
        "uses_fixtures": has_fixtures,
        "tests_edge_cases": has_edge_cases,
        "assertion_count": assertion_count,
        "mock_count": mock_count,
    }

Running the Full Pipeline

agent = TestGenerationAgent("./app", "./tests")
untested = agent.find_untested_functions()
print(f"Found {len(untested)} untested functions")

for target in untested[:5]:
    print(f"\nGenerating tests for {target.function_name}...")
    test_code = agent.generate_tests(target)
    validation = agent.validate_tests(test_code, target)

    if validation["passed"]:
        output_path = f"./tests/test_{target.function_name}.py"
        with open(output_path, "w") as f:
            f.write(test_code)
        print(f"  Saved {validation['tests_passed']} passing tests")
    else:
        print(f"  {validation['tests_failed']} tests failed, skipping")

FAQ

How do I prevent the agent from generating trivial tests that just check if a function exists?

The quality assessment step catches this. Check for meaningful assertions — tests that only call a function without asserting on the result score zero on assertion quality. Add a minimum assertion count per test as a threshold before accepting generated tests.

Should AI-generated tests replace hand-written tests?

No. AI-generated tests are best for establishing a baseline of coverage quickly. They catch regressions and document existing behavior. But hand-written tests are still essential for verifying business logic, testing complex interactions, and encoding domain-specific edge cases that the LLM might not infer from code alone.

How do I handle tests that need a real database or external service?

Prompt the agent to use mocks and fixtures for external dependencies by default. For integration tests, provide the agent with your existing conftest.py so it can reuse database fixtures, test clients, and service stubs that are already configured for your project.

#TestGeneration #AIAgents #Python #Pytest #Testing #AgenticAI #LearnAI #AIEngineering

Test Generation with AI Agents: Automatic Unit and Integration Test Creation

Why AI-Generated Tests Are Different from Copilot Suggestions

Analyzing Code Coverage Gaps

Generating Tests with Proper Fixtures

Validating Test Quality

Running the Full Pipeline

FAQ

How do I prevent the agent from generating trivial tests that just check if a function exists?

Should AI-generated tests replace hand-written tests?

How do I handle tests that need a real database or external service?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding