Prompt Testing and Iteration: A Scientific Approach to Prompt Development

Stop Guessing, Start Measuring

Most teams develop prompts by trial and error — tweak the wording, eyeball a few outputs, and ship. This works for prototypes but fails in production where prompts handle thousands of diverse inputs. Scientific prompt development means defining success metrics, building test suites, and making data-driven decisions about prompt changes.

The goal is to treat prompt development with the same rigor you apply to code: version control, automated tests, and measurable quality criteria.

Defining Evaluation Metrics

Before testing anything, define what "good" looks like. Different tasks need different metrics:

from dataclasses import dataclass
from enum import Enum


class MetricType(Enum):
    ACCURACY = "accuracy"          # Factual correctness
    RELEVANCE = "relevance"        # Stays on topic
    FORMAT_COMPLIANCE = "format"   # Follows output format
    CONCISENESS = "conciseness"    # Appropriate length
    SAFETY = "safety"              # No harmful content


@dataclass
class EvalCase:
    input_text: str
    expected_output: str | None = None  # For exact match
    expected_contains: list[str] | None = None  # Must include these
    expected_excludes: list[str] | None = None  # Must not include these
    max_length: int | None = None
    metric_type: MetricType = MetricType.ACCURACY


# Build a test suite for a classification task
classification_tests = [
    EvalCase(
        input_text="My order arrived broken and support won't help",
        expected_output="negative",
        metric_type=MetricType.ACCURACY,
    ),
    EvalCase(
        input_text="Works as expected, fast delivery",
        expected_output="positive",
        metric_type=MetricType.ACCURACY,
    ),
    EvalCase(
        input_text="The product is okay I guess",
        expected_output="neutral",
        metric_type=MetricType.ACCURACY,
    ),
    EvalCase(
        input_text="",  # Edge case: empty input
        expected_contains=["unable", "empty", "provide"],
        metric_type=MetricType.SAFETY,
    ),
]

Building a Prompt Test Runner

Automate prompt evaluation with a test runner that scores each prompt version:

import json
import time
from openai import OpenAI


class PromptEvaluator:
    def __init__(self, model: str = "gpt-4o"):
        self.client = OpenAI()
        self.model = model

    def evaluate(
        self,
        system_prompt: str,
        test_cases: list[EvalCase],
    ) -> dict:
        results = []
        passed = 0

        for case in test_cases:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": case.input_text},
                ]
            )
            output = response.choices[0].message.content.strip()

            case_passed = self._check_case(case, output)
            if case_passed:
                passed += 1

            results.append({
                "input": case.input_text[:80],
                "expected": case.expected_output,
                "actual": output[:200],
                "passed": case_passed,
                "metric": case.metric_type.value,
            })

        total = len(test_cases)
        return {
            "total": total,
            "passed": passed,
            "failed": total - passed,
            "score": round(passed / total * 100, 1) if total > 0 else 0,
            "results": results,
        }

    def _check_case(self, case: EvalCase, output: str) -> bool:
        output_lower = output.lower()

        if case.expected_output:
            if case.expected_output.lower() != output_lower:
                return False

        if case.expected_contains:
            if not any(kw.lower() in output_lower for kw in case.expected_contains):
                return False

        if case.expected_excludes:
            if any(kw.lower() in output_lower for kw in case.expected_excludes):
                return False

        if case.max_length and len(output) > case.max_length:
            return False

        return True

A/B Testing Prompts

Compare two prompt versions against the same test suite:

def ab_test_prompts(
    prompt_a: str,
    prompt_b: str,
    test_cases: list[EvalCase],
    model: str = "gpt-4o",
) -> dict:
    evaluator = PromptEvaluator(model=model)

    print("Evaluating Prompt A...")
    results_a = evaluator.evaluate(prompt_a, test_cases)

    print("Evaluating Prompt B...")
    results_b = evaluator.evaluate(prompt_b, test_cases)

    winner = "A" if results_a["score"] >= results_b["score"] else "B"

    return {
        "prompt_a_score": results_a["score"],
        "prompt_b_score": results_b["score"],
        "winner": winner,
        "improvement": round(
            abs(results_a["score"] - results_b["score"]), 1
        ),
        "details_a": results_a,
        "details_b": results_b,
    }


# Compare two versions of a sentiment classifier prompt
v1 = "Classify the sentiment as positive, neutral, or negative. Return only the label."
v2 = "Classify the customer review sentiment. Return exactly one word: positive, neutral, or negative. No explanation."

results = ab_test_prompts(v1, v2, classification_tests)
print(f"Winner: Prompt {results['winner']} ({results['improvement']}% better)")

Prompt Versioning System

Track prompt versions with metadata for full traceability:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import hashlib
from datetime import datetime


class PromptVersionStore:
    def __init__(self):
        self.versions: list[dict] = []

    def save_version(
        self,
        name: str,
        prompt_text: str,
        test_score: float,
        notes: str = "",
    ) -> str:
        version_hash = hashlib.sha256(
            prompt_text.encode()
        ).hexdigest()[:10]

        version = {
            "name": name,
            "version_hash": version_hash,
            "prompt_text": prompt_text,
            "test_score": test_score,
            "notes": notes,
            "created_at": datetime.utcnow().isoformat(),
            "word_count": len(prompt_text.split()),
        }
        self.versions.append(version)
        return version_hash

    def get_best(self, name: str) -> dict | None:
        matching = [v for v in self.versions if v["name"] == name]
        if not matching:
            return None
        return max(matching, key=lambda v: v["test_score"])

    def compare_versions(self, name: str) -> list[dict]:
        matching = [v for v in self.versions if v["name"] == name]
        return sorted(matching, key=lambda v: v["test_score"], reverse=True)


# Usage
store = PromptVersionStore()

store.save_version(
    name="sentiment_classifier",
    prompt_text=v1,
    test_score=results["details_a"]["score"],
    notes="Baseline version",
)

store.save_version(
    name="sentiment_classifier",
    prompt_text=v2,
    test_score=results["details_b"]["score"],
    notes="Added explicit single-word instruction",
)

best = store.get_best("sentiment_classifier")
print(f"Best version: {best['version_hash']} (score: {best['test_score']}%)")

Regression Testing

Prevent prompt changes from breaking previously working cases:

class PromptRegressionSuite:
    def __init__(self, golden_cases: list[EvalCase]):
        self.golden_cases = golden_cases
        self.baseline_results: dict | None = None

    def set_baseline(self, prompt: str):
        evaluator = PromptEvaluator()
        self.baseline_results = evaluator.evaluate(prompt, self.golden_cases)
        print(f"Baseline set: {self.baseline_results['score']}% pass rate")

    def check_regression(self, new_prompt: str, threshold: float = 0.0) -> bool:
        if not self.baseline_results:
            raise ValueError("Set a baseline first with set_baseline()")

        evaluator = PromptEvaluator()
        new_results = evaluator.evaluate(new_prompt, self.golden_cases)

        regression = self.baseline_results["score"] - new_results["score"]

        if regression > threshold:
            print(f"REGRESSION DETECTED: {regression}% drop")
            # Show which cases regressed
            for old, new in zip(
                self.baseline_results["results"],
                new_results["results"]
            ):
                if old["passed"] and not new["passed"]:
                    print(f"  Broke: {old['input']}")
            return False

        print(f"No regression. Score: {new_results['score']}% "
              f"(baseline: {self.baseline_results['score']}%)")
        return True

The Iteration Workflow

Effective prompt development follows a systematic cycle:

Define — Write evaluation cases before writing the prompt
Baseline — Test your first prompt version and record the score
Hypothesize — Identify the weakest test cases and form a theory about why they fail
Modify — Change one aspect of the prompt at a time
Measure — Run the full test suite, not just the cases you are trying to fix
Decide — Accept the change only if overall score improves without regressions
Document — Record what you changed and why in the version store

This cycle typically converges on a strong prompt within 5-10 iterations.

FAQ

How many test cases do I need?

Start with 20-30 cases that cover the full range of expected inputs, including 5-10 edge cases. For critical production prompts, build up to 100+ cases over time by adding each failure case you discover in production. Quality matters more than quantity — 30 well-chosen cases beat 200 random ones.

Should I use LLMs to evaluate LLM outputs?

Yes, for certain metrics. "LLM-as-judge" works well for evaluating relevance, tone, and helpfulness — metrics that are hard to check programmatically. For factual accuracy and format compliance, prefer deterministic checks (string matching, JSON parsing, regex). Combine both approaches for comprehensive evaluation.

How do I handle non-deterministic outputs?

Run each test case 3-5 times and use the majority result. If a prompt passes 3 out of 5 runs, it has a 60% reliability rate for that case. Set your acceptance threshold (e.g., 80% reliability per case) and iterate until all cases meet it. Lower the temperature to reduce variability when consistency matters more than creativity.

#PromptTesting #Evaluation #ABTesting #PromptVersioning #Python #AgenticAI #LearnAI #AIEngineering

Prompt Testing and Iteration: A Scientific Approach to Prompt Development

Stop Guessing, Start Measuring

Defining Evaluation Metrics

Building a Prompt Test Runner

A/B Testing Prompts

Prompt Versioning System

Regression Testing

The Iteration Workflow

FAQ

How many test cases do I need?

Should I use LLMs to evaluate LLM outputs?

How do I handle non-deterministic outputs?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding