Building an Agent Playground: Interactive Testing Environment for Prompt and Tool Development
Build a full-featured agent playground with a web UI that lets you test prompts live, tune parameters, compare model outputs side by side, and export working configurations for production deployment.
Why Build a Playground
Developing AI agents in a code editor is like writing CSS without a browser preview. You change a prompt, restart the script, re-type your test input, and wait for the response. A playground gives you a live feedback loop: edit the system prompt on the left, see the output on the right, toggle between models, adjust temperature with a slider, and compare results across configurations — all without leaving the browser.
Commercial playgrounds exist (OpenAI Playground, Anthropic Console), but they do not support custom tools, multi-agent handoffs, or your specific pipeline. Building your own gives you a testing environment tailored to your agent architecture.
Backend: The Playground API
The backend exposes endpoints for running agent configurations, managing saved presets, and streaming results.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import json
import litellm
app = FastAPI()
class PlaygroundConfig(BaseModel):
model: str = "gpt-4o"
system_prompt: str = "You are a helpful assistant."
temperature: float = 0.7
max_tokens: int = 2048
top_p: float = 1.0
tools: list[dict] | None = None
user_message: str = ""
class ComparisonRequest(BaseModel):
configs: list[PlaygroundConfig]
user_message: str
@app.post("/api/playground/run")
async def run_config(config: PlaygroundConfig):
"""Execute a single playground configuration."""
messages = [
{"role": "system", "content": config.system_prompt},
{"role": "user", "content": config.user_message},
]
try:
response = await litellm.acompletion(
model=config.model,
messages=messages,
temperature=config.temperature,
max_tokens=config.max_tokens,
top_p=config.top_p,
)
return {
"output": response.choices[0].message.content,
"usage": {
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
},
"model": config.model,
"finish_reason": response.choices[0].finish_reason,
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/playground/compare")
async def compare_configs(request: ComparisonRequest):
"""Run the same message through multiple configurations."""
import asyncio
async def run_one(config: PlaygroundConfig):
config.user_message = request.user_message
return await run_config(config)
results = await asyncio.gather(
*[run_one(c) for c in request.configs],
return_exceptions=True,
)
return {
"results": [
r if not isinstance(r, Exception) else {"error": str(r)}
for r in results
]
}
Preset Management
Save and load configurations so you can iterate on what works.
import sqlite3
from datetime import datetime
class PresetStore:
def __init__(self, db_path: str = "playground.db"):
self.db = sqlite3.connect(db_path)
self.db.execute("""
CREATE TABLE IF NOT EXISTS presets (
id TEXT PRIMARY KEY,
name TEXT,
config TEXT,
created_at TEXT,
updated_at TEXT
)
""")
def save_preset(self, preset_id: str, name: str, config: dict):
self.db.execute(
"""INSERT INTO presets (id, name, config, created_at, updated_at)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(id) DO UPDATE SET
name = ?, config = ?, updated_at = ?""",
(preset_id, name, json.dumps(config), datetime.utcnow().isoformat(),
datetime.utcnow().isoformat(), name, json.dumps(config),
datetime.utcnow().isoformat()),
)
self.db.commit()
def list_presets(self) -> list[dict]:
rows = self.db.execute(
"SELECT id, name, config, updated_at FROM presets ORDER BY updated_at DESC"
).fetchall()
return [
{"id": r[0], "name": r[1], "config": json.loads(r[2]), "updated_at": r[3]}
for r in rows
]
presets = PresetStore()
@app.get("/api/playground/presets")
def list_presets():
return presets.list_presets()
@app.post("/api/playground/presets/{preset_id}")
def save_preset(preset_id: str, name: str, config: PlaygroundConfig):
presets.save_preset(preset_id, name, config.model_dump())
return {"status": "saved"}
Frontend: The Playground UI
The UI has three main panels: configuration (left), conversation (center), and results/comparison (right).
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
// components/PlaygroundEditor.tsx
"use client";
import { useState } from "react";
interface PlaygroundState {
model: string;
systemPrompt: string;
temperature: number;
maxTokens: number;
userMessage: string;
}
export default function PlaygroundEditor() {
const [config, setConfig] = useState<PlaygroundState>({
model: "gpt-4o",
systemPrompt: "You are a helpful assistant.",
temperature: 0.7,
maxTokens: 2048,
userMessage: "",
});
const [output, setOutput] = useState("");
const [loading, setLoading] = useState(false);
const [usage, setUsage] = useState<{ input_tokens: number; output_tokens: number } | null>(null);
async function runPlayground() {
setLoading(true);
const res = await fetch("/api/playground/run", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: config.model,
system_prompt: config.systemPrompt,
temperature: config.temperature,
max_tokens: config.maxTokens,
user_message: config.userMessage,
}),
});
const data = await res.json();
setOutput(data.output);
setUsage(data.usage);
setLoading(false);
}
return (
<div className="grid grid-cols-3 gap-4 h-screen p-4">
{/* Config Panel */}
<div className="space-y-4 overflow-y-auto">
<select
value={config.model}
onChange={(e) => setConfig({ ...config, model: e.target.value })}
className="w-full p-2 border rounded"
>
<option value="gpt-4o">GPT-4o</option>
<option value="gpt-4o-mini">GPT-4o Mini</option>
<option value="claude-sonnet-4-20250514">Claude Sonnet</option>
</select>
<label className="block text-sm">
Temperature: {config.temperature}
<input
type="range" min="0" max="2" step="0.1"
value={config.temperature}
onChange={(e) => setConfig({ ...config, temperature: parseFloat(e.target.value) })}
className="w-full"
/>
</label>
<textarea
value={config.systemPrompt}
onChange={(e) => setConfig({ ...config, systemPrompt: e.target.value })}
className="w-full h-48 p-2 border rounded font-mono text-sm"
placeholder="System prompt..."
/>
</div>
{/* Input Panel */}
<div className="flex flex-col">
<textarea
value={config.userMessage}
onChange={(e) => setConfig({ ...config, userMessage: e.target.value })}
className="flex-1 p-2 border rounded font-mono text-sm"
placeholder="User message..."
/>
<button
onClick={runPlayground}
disabled={loading}
className="mt-2 p-2 bg-blue-500 text-white rounded disabled:opacity-50"
>
{loading ? "Running..." : "Run"}
</button>
</div>
{/* Output Panel */}
<div className="overflow-y-auto p-4 border rounded bg-gray-50">
<pre className="whitespace-pre-wrap text-sm">{output}</pre>
{usage && (
<div className="mt-4 text-xs text-gray-500">
Tokens: {usage.input_tokens} in / {usage.output_tokens} out
</div>
)}
</div>
</div>
);
}
Side-by-Side Comparison Mode
The most powerful feature is running the same input through multiple configurations simultaneously.
function ComparisonMode() {
const [configs, setConfigs] = useState<PlaygroundState[]>([
{ model: "gpt-4o-mini", systemPrompt: "Be concise.", temperature: 0.3, maxTokens: 1024, userMessage: "" },
{ model: "gpt-4o", systemPrompt: "Be thorough.", temperature: 0.7, maxTokens: 2048, userMessage: "" },
]);
const [results, setResults] = useState<string[]>([]);
async function runComparison(message: string) {
const res = await fetch("/api/playground/compare", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
configs: configs.map((c) => ({
model: c.model,
system_prompt: c.systemPrompt,
temperature: c.temperature,
max_tokens: c.maxTokens,
})),
user_message: message,
}),
});
const data = await res.json();
setResults(data.results.map((r: any) => r.output || r.error));
}
return (
<div className="grid" style={{ gridTemplateColumns: `repeat(${configs.length}, 1fr)` }}>
{configs.map((config, i) => (
<div key={i} className="p-4 border-r">
<h3 className="font-bold">{config.model}</h3>
<pre className="text-sm mt-2">{results[i] || "No output yet"}</pre>
</div>
))}
</div>
);
}
Exporting Configurations
Once you find a configuration that works, export it as code ready for production.
@app.post("/api/playground/export")
def export_config(config: PlaygroundConfig):
"""Generate production-ready agent code from a playground config."""
code = f'''from agents import Agent, ModelSettings
agent = Agent(
name="Production Agent",
instructions="""{config.system_prompt}""",
model="{config.model}",
model_settings=ModelSettings(
temperature={config.temperature},
max_tokens={config.max_tokens},
top_p={config.top_p},
),
)
'''
return {"code": code, "language": "python"}
FAQ
How do you handle tool testing in the playground?
Add a tool definition panel where users can write tool schemas (name, description, parameters) and mock return values. When the agent calls a tool during playground execution, the system returns the mocked value instead of executing real code. This lets you test tool-calling behavior without wiring up actual integrations. Once the prompt reliably triggers the right tools, export the configuration and connect real tool implementations.
Should the playground support multi-turn conversations?
Yes. Store conversation history in the client state and send the full message array with each request. Add a "reset conversation" button and a "fork from here" feature that lets you branch the conversation at any message to test different follow-ups from the same point. This is essential for testing agents that maintain context across turns.
How do you prevent playground abuse in a team setting?
Add API key scoping so each team member uses their own LLM credits. Rate-limit the compare endpoint (which multiplies costs by the number of configs). Log all playground runs with the user, configuration, and cost. Set daily cost caps per user and alert when thresholds are approached.
#AgentPlayground #PromptEngineering #DeveloperTools #AITesting #LiveTesting #ModelComparison #AgentDevelopment #DevTools
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.