Parallel LLM Calls: When to Run Multiple Completions Simultaneously

Why Run LLM Calls in Parallel

Sequential LLM calls are the default in most agent frameworks. The agent calls one model, waits for the response, processes it, then calls again. This is simple but slow. If your agent needs to gather information from three different tools and then synthesize the results, sequential execution means the total latency is the sum of all calls.

Parallel execution flips this. When calls are independent — meaning one does not depend on the output of another — you can run them simultaneously. The total latency becomes the duration of the slowest single call, not the sum.

The Fan-Out Pattern

The most common parallel pattern in AI agents is fan-out: send the same or different prompts to the LLM simultaneously, then collect all results.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def fan_out_analysis(document: str) -> dict:
    """Analyze a document from three perspectives in parallel."""
    prompts = {
        "summary": f"Summarize this document in 3 sentences:\n{document}",
        "sentiment": f"What is the overall sentiment of this document? "
                     f"Reply with: positive, negative, or neutral.\n{document}",
        "key_entities": f"Extract the top 5 named entities from this document "
                        f"as a JSON list:\n{document}",
    }

    async def call_llm(name: str, prompt: str) -> tuple[str, str]:
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300,
        )
        return name, response.choices[0].message.content

    tasks = [call_llm(name, prompt) for name, prompt in prompts.items()]
    results = await asyncio.gather(*tasks)

    return dict(results)

# Usage
analysis = await fan_out_analysis("Acme Corp reported record Q4 earnings...")
# Returns: {"summary": "...", "sentiment": "positive", "key_entities": "[...]"}

This completes in the time of the slowest single call rather than three times the average.

Best-of-N: Running the Same Prompt Multiple Times

Sometimes you want the best possible response, not just the fastest. The best-of-N pattern sends the same prompt to the LLM multiple times (or to different models) and selects the best result.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def best_of_n(prompt: str, n: int = 3, judge_prompt: str = None) -> str:
    """Generate N responses and select the best one."""
    async def generate_one(index: int) -> str:
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8,  # Higher temp for diversity
        )
        return response.choices[0].message.content

    # Generate N candidates in parallel
    candidates = await asyncio.gather(
        *[generate_one(i) for i in range(n)]
    )

    # Use a judge to pick the best
    if judge_prompt is None:
        judge_prompt = "You are a quality judge. Pick the best response."

    numbered = "\n\n".join(
        f"--- Response {i+1} ---\n{c}" for i, c in enumerate(candidates)
    )
    judge_response = await client.chat.completions.create(
        model="gpt-4o-mini",  # Cheap model for judging
        messages=[
            {"role": "system", "content": judge_prompt},
            {"role": "user", "content": f"Original query: {prompt}\n\n{numbered}\n\n"
             "Reply with ONLY the number (1, 2, or 3) of the best response."},
        ],
        max_tokens=5,
    )
    choice = int(judge_response.choices[0].message.content.strip()) - 1
    return candidates[max(0, min(choice, len(candidates) - 1))]

The cost is N times a single call, but the latency overhead is only the judge call since all candidates generate simultaneously.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Timeout Handling for Parallel Calls

In production, you cannot wait indefinitely for every parallel call. Some will be slow or fail. Use asyncio.wait with timeouts to handle this gracefully.

import asyncio

async def parallel_with_timeout(tasks: list, timeout: float = 10.0) -> list:
    """Run tasks in parallel with a global timeout. Return completed results."""
    wrapped = [asyncio.ensure_future(t) for t in tasks]

    done, pending = await asyncio.wait(
        wrapped,
        timeout=timeout,
        return_when=asyncio.ALL_COMPLETED,
    )

    # Cancel any tasks that did not complete in time
    for task in pending:
        task.cancel()

    results = []
    for task in done:
        try:
            results.append(task.result())
        except Exception as e:
            results.append({"error": str(e)})

    return results

# Usage
tasks = [
    call_llm("summarize", doc),
    call_llm("extract_entities", doc),
    call_llm("classify", doc),
]
results = await parallel_with_timeout(tasks, timeout=8.0)

Cost-Speed Tradeoffs

Parallel calls reduce latency but multiply cost. Here is a framework for deciding when the tradeoff is worth it.

from dataclasses import dataclass

@dataclass
class ParallelDecision:
    sequential_latency_ms: float
    parallel_latency_ms: float
    cost_multiplier: float
    user_facing: bool

    @property
    def latency_savings_pct(self) -> float:
        return (1 - self.parallel_latency_ms / self.sequential_latency_ms) * 100

    def should_parallelize(self) -> bool:
        # User-facing: parallelize if saving > 30% latency
        if self.user_facing:
            return self.latency_savings_pct > 30

        # Background: only parallelize if cost multiplier < 1.5x
        return self.cost_multiplier < 1.5

# Example decisions
decision = ParallelDecision(
    sequential_latency_ms=4500,
    parallel_latency_ms=1800,
    cost_multiplier=3.0,
    user_facing=True,
)
print(decision.should_parallelize())  # True: 60% latency savings for user-facing

Parallel Tool Calls in Agent Frameworks

Most modern agent frameworks support parallel tool calls natively. When the LLM decides it needs to call multiple tools, the framework runs them simultaneously.

from agents import Agent, Runner, function_tool

@function_tool
async def get_weather(city: str) -> str:
    # Simulated API call
    return f"72F and sunny in {city}"

@function_tool
async def get_news(topic: str) -> str:
    return f"Latest news about {topic}: market up 2%"

@function_tool
async def get_calendar(date: str) -> str:
    return f"3 meetings scheduled for {date}"

agent = Agent(
    name="Assistant",
    instructions="Use tools in parallel when possible.",
    tools=[get_weather, get_news, get_calendar],
)

# The LLM may request all three tools at once
# The framework executes them in parallel automatically
result = await Runner.run(agent, "What is the weather in NYC, today's news on AI, and my calendar for today?")

FAQ

When should I NOT parallelize LLM calls?

Do not parallelize when calls are dependent — the output of one call is the input to another. Also avoid it for background batch processing where latency does not matter but cost does, since parallel calls cost N times more. Finally, be cautious with rate limits: sending 10 parallel calls may trigger throttling.

How do I handle partial failures in parallel execution?

Use asyncio.gather(return_exceptions=True) to collect both successes and failures, then process only the successful results. For critical operations, implement a fallback strategy where you retry failed calls sequentially after the parallel batch completes.

Does parallel execution affect rate limits with LLM providers?

Yes. Each parallel call counts against your rate limit independently. If your rate limit is 60 requests per minute and you send 5 parallel calls per user query, you can only handle 12 user queries per minute. Monitor your rate limit headers and implement backpressure when approaching limits.

#ParallelProcessing #Concurrency #Performance #AsyncPython #Python #AgenticAI #LearnAI #AIEngineering

Parallel LLM Calls: When to Run Multiple Completions Simultaneously

Why Run LLM Calls in Parallel

The Fan-Out Pattern

Best-of-N: Running the Same Prompt Multiple Times

Timeout Handling for Parallel Calls

Cost-Speed Tradeoffs

Parallel Tool Calls in Agent Frameworks

FAQ

When should I NOT parallelize LLM calls?

How do I handle partial failures in parallel execution?

Does parallel execution affect rate limits with LLM providers?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding