Building an Agent Playground: Interactive Testing Environment for Prompt and Tool Development

Why Build a Playground

Developing AI agents in a code editor is like writing CSS without a browser preview. You change a prompt, restart the script, re-type your test input, and wait for the response. A playground gives you a live feedback loop: edit the system prompt on the left, see the output on the right, toggle between models, adjust temperature with a slider, and compare results across configurations — all without leaving the browser.

Commercial playgrounds exist (OpenAI Playground, Anthropic Console), but they do not support custom tools, multi-agent handoffs, or your specific pipeline. Building your own gives you a testing environment tailored to your agent architecture.

Backend: The Playground API

The backend exposes endpoints for running agent configurations, managing saved presets, and streaming results.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import json
import litellm

app = FastAPI()

class PlaygroundConfig(BaseModel):
    model: str = "gpt-4o"
    system_prompt: str = "You are a helpful assistant."
    temperature: float = 0.7
    max_tokens: int = 2048
    top_p: float = 1.0
    tools: list[dict] | None = None
    user_message: str = ""

class ComparisonRequest(BaseModel):
    configs: list[PlaygroundConfig]
    user_message: str

@app.post("/api/playground/run")
async def run_config(config: PlaygroundConfig):
    """Execute a single playground configuration."""
    messages = [
        {"role": "system", "content": config.system_prompt},
        {"role": "user", "content": config.user_message},
    ]

    try:
        response = await litellm.acompletion(
            model=config.model,
            messages=messages,
            temperature=config.temperature,
            max_tokens=config.max_tokens,
            top_p=config.top_p,
        )

        return {
            "output": response.choices[0].message.content,
            "usage": {
                "input_tokens": response.usage.prompt_tokens,
                "output_tokens": response.usage.completion_tokens,
            },
            "model": config.model,
            "finish_reason": response.choices[0].finish_reason,
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/api/playground/compare")
async def compare_configs(request: ComparisonRequest):
    """Run the same message through multiple configurations."""
    import asyncio

    async def run_one(config: PlaygroundConfig):
        config.user_message = request.user_message
        return await run_config(config)

    results = await asyncio.gather(
        *[run_one(c) for c in request.configs],
        return_exceptions=True,
    )

    return {
        "results": [
            r if not isinstance(r, Exception) else {"error": str(r)}
            for r in results
        ]
    }

Preset Management

Save and load configurations so you can iterate on what works.

import sqlite3
from datetime import datetime

class PresetStore:
    def __init__(self, db_path: str = "playground.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS presets (
                id TEXT PRIMARY KEY,
                name TEXT,
                config TEXT,
                created_at TEXT,
                updated_at TEXT
            )
        """)

    def save_preset(self, preset_id: str, name: str, config: dict):
        self.db.execute(
            """INSERT INTO presets (id, name, config, created_at, updated_at)
            VALUES (?, ?, ?, ?, ?)
            ON CONFLICT(id) DO UPDATE SET
                name = ?, config = ?, updated_at = ?""",
            (preset_id, name, json.dumps(config), datetime.utcnow().isoformat(),
             datetime.utcnow().isoformat(), name, json.dumps(config),
             datetime.utcnow().isoformat()),
        )
        self.db.commit()

    def list_presets(self) -> list[dict]:
        rows = self.db.execute(
            "SELECT id, name, config, updated_at FROM presets ORDER BY updated_at DESC"
        ).fetchall()
        return [
            {"id": r[0], "name": r[1], "config": json.loads(r[2]), "updated_at": r[3]}
            for r in rows
        ]

presets = PresetStore()

@app.get("/api/playground/presets")
def list_presets():
    return presets.list_presets()

@app.post("/api/playground/presets/{preset_id}")
def save_preset(preset_id: str, name: str, config: PlaygroundConfig):
    presets.save_preset(preset_id, name, config.model_dump())
    return {"status": "saved"}

Frontend: The Playground UI

The UI has three main panels: configuration (left), conversation (center), and results/comparison (right).

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

// components/PlaygroundEditor.tsx
"use client";
import { useState } from "react";

interface PlaygroundState {
  model: string;
  systemPrompt: string;
  temperature: number;
  maxTokens: number;
  userMessage: string;
}

export default function PlaygroundEditor() {
  const [config, setConfig] = useState<PlaygroundState>({
    model: "gpt-4o",
    systemPrompt: "You are a helpful assistant.",
    temperature: 0.7,
    maxTokens: 2048,
    userMessage: "",
  });
  const [output, setOutput] = useState("");
  const [loading, setLoading] = useState(false);
  const [usage, setUsage] = useState<{ input_tokens: number; output_tokens: number } | null>(null);

  async function runPlayground() {
    setLoading(true);
    const res = await fetch("/api/playground/run", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: config.model,
        system_prompt: config.systemPrompt,
        temperature: config.temperature,
        max_tokens: config.maxTokens,
        user_message: config.userMessage,
      }),
    });
    const data = await res.json();
    setOutput(data.output);
    setUsage(data.usage);
    setLoading(false);
  }

  return (
    <div className="grid grid-cols-3 gap-4 h-screen p-4">
      {/* Config Panel */}
      <div className="space-y-4 overflow-y-auto">
        <select
          value={config.model}
          onChange={(e) => setConfig({ ...config, model: e.target.value })}
          className="w-full p-2 border rounded"
        >
          <option value="gpt-4o">GPT-4o</option>
          <option value="gpt-4o-mini">GPT-4o Mini</option>
          <option value="claude-sonnet-4-20250514">Claude Sonnet</option>
        </select>

        <label className="block text-sm">
          Temperature: {config.temperature}
          <input
            type="range" min="0" max="2" step="0.1"
            value={config.temperature}
            onChange={(e) => setConfig({ ...config, temperature: parseFloat(e.target.value) })}
            className="w-full"
          />
        </label>

        <textarea
          value={config.systemPrompt}
          onChange={(e) => setConfig({ ...config, systemPrompt: e.target.value })}
          className="w-full h-48 p-2 border rounded font-mono text-sm"
          placeholder="System prompt..."
        />
      </div>

      {/* Input Panel */}
      <div className="flex flex-col">
        <textarea
          value={config.userMessage}
          onChange={(e) => setConfig({ ...config, userMessage: e.target.value })}
          className="flex-1 p-2 border rounded font-mono text-sm"
          placeholder="User message..."
        />
        <button
          onClick={runPlayground}
          disabled={loading}
          className="mt-2 p-2 bg-blue-500 text-white rounded disabled:opacity-50"
        >
          {loading ? "Running..." : "Run"}
        </button>
      </div>

      {/* Output Panel */}
      <div className="overflow-y-auto p-4 border rounded bg-gray-50">
        <pre className="whitespace-pre-wrap text-sm">{output}</pre>
        {usage && (
          <div className="mt-4 text-xs text-gray-500">
            Tokens: {usage.input_tokens} in / {usage.output_tokens} out
          </div>
        )}
      </div>
    </div>
  );
}

Side-by-Side Comparison Mode

The most powerful feature is running the same input through multiple configurations simultaneously.

function ComparisonMode() {
  const [configs, setConfigs] = useState<PlaygroundState[]>([
    { model: "gpt-4o-mini", systemPrompt: "Be concise.", temperature: 0.3, maxTokens: 1024, userMessage: "" },
    { model: "gpt-4o", systemPrompt: "Be thorough.", temperature: 0.7, maxTokens: 2048, userMessage: "" },
  ]);
  const [results, setResults] = useState<string[]>([]);

  async function runComparison(message: string) {
    const res = await fetch("/api/playground/compare", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        configs: configs.map((c) => ({
          model: c.model,
          system_prompt: c.systemPrompt,
          temperature: c.temperature,
          max_tokens: c.maxTokens,
        })),
        user_message: message,
      }),
    });
    const data = await res.json();
    setResults(data.results.map((r: any) => r.output || r.error));
  }

  return (
    <div className="grid" style={{ gridTemplateColumns: `repeat(${configs.length}, 1fr)` }}>
      {configs.map((config, i) => (
        <div key={i} className="p-4 border-r">
          <h3 className="font-bold">{config.model}</h3>
          <pre className="text-sm mt-2">{results[i] || "No output yet"}</pre>
        </div>
      ))}
    </div>
  );
}

Exporting Configurations

Once you find a configuration that works, export it as code ready for production.

@app.post("/api/playground/export")
def export_config(config: PlaygroundConfig):
    """Generate production-ready agent code from a playground config."""
    code = f'''from agents import Agent, ModelSettings

agent = Agent(
    name="Production Agent",
    instructions="""{config.system_prompt}""",
    model="{config.model}",
    model_settings=ModelSettings(
        temperature={config.temperature},
        max_tokens={config.max_tokens},
        top_p={config.top_p},
    ),
)
'''
    return {"code": code, "language": "python"}

FAQ

How do you handle tool testing in the playground?

Add a tool definition panel where users can write tool schemas (name, description, parameters) and mock return values. When the agent calls a tool during playground execution, the system returns the mocked value instead of executing real code. This lets you test tool-calling behavior without wiring up actual integrations. Once the prompt reliably triggers the right tools, export the configuration and connect real tool implementations.

Should the playground support multi-turn conversations?

Yes. Store conversation history in the client state and send the full message array with each request. Add a "reset conversation" button and a "fork from here" feature that lets you branch the conversation at any message to test different follow-ups from the same point. This is essential for testing agents that maintain context across turns.

How do you prevent playground abuse in a team setting?

Add API key scoping so each team member uses their own LLM credits. Rate-limit the compare endpoint (which multiplies costs by the number of configs). Log all playground runs with the user, configuration, and cost. Set daily cost caps per user and alert when thresholds are approached.

#AgentPlayground #PromptEngineering #DeveloperTools #AITesting #LiveTesting #ModelComparison #AgentDevelopment #DevTools

Building an Agent Playground: Interactive Testing Environment for Prompt and Tool Development

Why Build a Playground

Backend: The Playground API

Preset Management

Frontend: The Playground UI

Side-by-Side Comparison Mode

Exporting Configurations

FAQ

How do you handle tool testing in the playground?

Should the playground support multi-turn conversations?

How do you prevent playground abuse in a team setting?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding