Parallel LLM Calls: When to Run Multiple Completions Simultaneously
Learn when and how to run multiple LLM completions in parallel, including fan-out patterns, cost-speed tradeoffs, result selection strategies, and timeout handling for production AI agents.
Why Run LLM Calls in Parallel
Sequential LLM calls are the default in most agent frameworks. The agent calls one model, waits for the response, processes it, then calls again. This is simple but slow. If your agent needs to gather information from three different tools and then synthesize the results, sequential execution means the total latency is the sum of all calls.
Parallel execution flips this. When calls are independent — meaning one does not depend on the output of another — you can run them simultaneously. The total latency becomes the duration of the slowest single call, not the sum.
The Fan-Out Pattern
The most common parallel pattern in AI agents is fan-out: send the same or different prompts to the LLM simultaneously, then collect all results.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def fan_out_analysis(document: str) -> dict:
"""Analyze a document from three perspectives in parallel."""
prompts = {
"summary": f"Summarize this document in 3 sentences:\n{document}",
"sentiment": f"What is the overall sentiment of this document? "
f"Reply with: positive, negative, or neutral.\n{document}",
"key_entities": f"Extract the top 5 named entities from this document "
f"as a JSON list:\n{document}",
}
async def call_llm(name: str, prompt: str) -> tuple[str, str]:
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
)
return name, response.choices[0].message.content
tasks = [call_llm(name, prompt) for name, prompt in prompts.items()]
results = await asyncio.gather(*tasks)
return dict(results)
# Usage
analysis = await fan_out_analysis("Acme Corp reported record Q4 earnings...")
# Returns: {"summary": "...", "sentiment": "positive", "key_entities": "[...]"}
This completes in the time of the slowest single call rather than three times the average.
Best-of-N: Running the Same Prompt Multiple Times
Sometimes you want the best possible response, not just the fastest. The best-of-N pattern sends the same prompt to the LLM multiple times (or to different models) and selects the best result.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def best_of_n(prompt: str, n: int = 3, judge_prompt: str = None) -> str:
"""Generate N responses and select the best one."""
async def generate_one(index: int) -> str:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.8, # Higher temp for diversity
)
return response.choices[0].message.content
# Generate N candidates in parallel
candidates = await asyncio.gather(
*[generate_one(i) for i in range(n)]
)
# Use a judge to pick the best
if judge_prompt is None:
judge_prompt = "You are a quality judge. Pick the best response."
numbered = "\n\n".join(
f"--- Response {i+1} ---\n{c}" for i, c in enumerate(candidates)
)
judge_response = await client.chat.completions.create(
model="gpt-4o-mini", # Cheap model for judging
messages=[
{"role": "system", "content": judge_prompt},
{"role": "user", "content": f"Original query: {prompt}\n\n{numbered}\n\n"
"Reply with ONLY the number (1, 2, or 3) of the best response."},
],
max_tokens=5,
)
choice = int(judge_response.choices[0].message.content.strip()) - 1
return candidates[max(0, min(choice, len(candidates) - 1))]
The cost is N times a single call, but the latency overhead is only the judge call since all candidates generate simultaneously.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Timeout Handling for Parallel Calls
In production, you cannot wait indefinitely for every parallel call. Some will be slow or fail. Use asyncio.wait with timeouts to handle this gracefully.
import asyncio
async def parallel_with_timeout(tasks: list, timeout: float = 10.0) -> list:
"""Run tasks in parallel with a global timeout. Return completed results."""
wrapped = [asyncio.ensure_future(t) for t in tasks]
done, pending = await asyncio.wait(
wrapped,
timeout=timeout,
return_when=asyncio.ALL_COMPLETED,
)
# Cancel any tasks that did not complete in time
for task in pending:
task.cancel()
results = []
for task in done:
try:
results.append(task.result())
except Exception as e:
results.append({"error": str(e)})
return results
# Usage
tasks = [
call_llm("summarize", doc),
call_llm("extract_entities", doc),
call_llm("classify", doc),
]
results = await parallel_with_timeout(tasks, timeout=8.0)
Cost-Speed Tradeoffs
Parallel calls reduce latency but multiply cost. Here is a framework for deciding when the tradeoff is worth it.
from dataclasses import dataclass
@dataclass
class ParallelDecision:
sequential_latency_ms: float
parallel_latency_ms: float
cost_multiplier: float
user_facing: bool
@property
def latency_savings_pct(self) -> float:
return (1 - self.parallel_latency_ms / self.sequential_latency_ms) * 100
def should_parallelize(self) -> bool:
# User-facing: parallelize if saving > 30% latency
if self.user_facing:
return self.latency_savings_pct > 30
# Background: only parallelize if cost multiplier < 1.5x
return self.cost_multiplier < 1.5
# Example decisions
decision = ParallelDecision(
sequential_latency_ms=4500,
parallel_latency_ms=1800,
cost_multiplier=3.0,
user_facing=True,
)
print(decision.should_parallelize()) # True: 60% latency savings for user-facing
Parallel Tool Calls in Agent Frameworks
Most modern agent frameworks support parallel tool calls natively. When the LLM decides it needs to call multiple tools, the framework runs them simultaneously.
from agents import Agent, Runner, function_tool
@function_tool
async def get_weather(city: str) -> str:
# Simulated API call
return f"72F and sunny in {city}"
@function_tool
async def get_news(topic: str) -> str:
return f"Latest news about {topic}: market up 2%"
@function_tool
async def get_calendar(date: str) -> str:
return f"3 meetings scheduled for {date}"
agent = Agent(
name="Assistant",
instructions="Use tools in parallel when possible.",
tools=[get_weather, get_news, get_calendar],
)
# The LLM may request all three tools at once
# The framework executes them in parallel automatically
result = await Runner.run(agent, "What is the weather in NYC, today's news on AI, and my calendar for today?")
FAQ
When should I NOT parallelize LLM calls?
Do not parallelize when calls are dependent — the output of one call is the input to another. Also avoid it for background batch processing where latency does not matter but cost does, since parallel calls cost N times more. Finally, be cautious with rate limits: sending 10 parallel calls may trigger throttling.
How do I handle partial failures in parallel execution?
Use asyncio.gather(return_exceptions=True) to collect both successes and failures, then process only the successful results. For critical operations, implement a fallback strategy where you retry failed calls sequentially after the parallel batch completes.
Does parallel execution affect rate limits with LLM providers?
Yes. Each parallel call counts against your rate limit independently. If your rate limit is 60 requests per minute and you send 5 parallel calls per user query, you can only handle 12 user queries per minute. Monitor your rate limit headers and implement backpressure when approaching limits.
#ParallelProcessing #Concurrency #Performance #AsyncPython #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.