Async OpenAI Client: Building High-Throughput AI Applications

Why Async Matters for AI Applications

Synchronous OpenAI API calls block your Python thread while waiting for the response — typically 1 to 10 seconds per request. If you need to process 100 items, that means 100 sequential waits. With async programming, you can fire off many requests concurrently and process them as they complete, reducing total wall-clock time dramatically.

The OpenAI Python SDK ships with a fully async client that integrates seamlessly with Python's asyncio event loop.

The AsyncOpenAI Client

The async client mirrors the synchronous API exactly, but every method is a coroutine:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def main():
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello, async world!"}],
    )
    print(response.choices[0].message.content)

asyncio.run(main())

The AsyncOpenAI client uses httpx.AsyncClient under the hood, which provides connection pooling and HTTP/2 support automatically.

Concurrent Requests with asyncio.gather

The biggest win comes from running multiple requests at the same time:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def summarize(text: str) -> str:
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize the following text in one sentence."},
            {"role": "user", "content": text},
        ],
    )
    return response.choices[0].message.content

async def main():
    articles = [
        "Python 3.13 introduces a new JIT compiler that improves performance...",
        "The European Union's AI Act requires transparency for high-risk systems...",
        "SpaceX successfully launched its 300th Falcon 9 mission this quarter...",
        "OpenAI released GPT-4o with native multimodal capabilities...",
        "Rust adoption in enterprise backends grew by 40% in 2025...",
    ]

    # Run all 5 summaries concurrently
    summaries = await asyncio.gather(*[summarize(article) for article in articles])

    for article, summary in zip(articles, summaries):
        print(f"Original: {article[:50]}...")
        print(f"Summary: {summary}")
        print()

asyncio.run(main())

With synchronous code, this takes 5x the time of a single request. With asyncio.gather, all five requests run concurrently and the total time is roughly equal to the slowest single request.

Controlling Concurrency with Semaphores

Firing 1000 concurrent requests will hit rate limits. Use a semaphore to cap concurrency:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()
semaphore = asyncio.Semaphore(10)  # max 10 concurrent requests

async def process_item(item: str) -> str:
    async with semaphore:
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": f"Classify this feedback: {item}"}],
        )
        return response.choices[0].message.content

async def main():
    feedback_items = [f"Feedback item {i}" for i in range(100)]

    tasks = [process_item(item) for item in feedback_items]
    results = await asyncio.gather(*tasks)

    print(f"Processed {len(results)} items")

asyncio.run(main())

The semaphore ensures no more than 10 requests are in-flight at any moment, preventing rate limit errors while still processing items much faster than sequential code.

Async Streaming

Combine async with streaming for the best real-time experience:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def stream_chat(prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    async for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
    print()

asyncio.run(stream_chat("Explain event loops in Python."))

Processing Results as They Complete

When tasks have variable completion times, asyncio.as_completed lets you handle results as they arrive:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def analyze(text: str, index: int) -> tuple[int, str]:
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Extract the sentiment: positive, negative, or neutral."},
            {"role": "user", "content": text},
        ],
    )
    return index, response.choices[0].message.content

async def main():
    texts = [
        "This product is amazing! Best purchase ever.",
        "Terrible experience. Will never buy again.",
        "It works fine. Nothing special.",
    ]

    tasks = [analyze(text, i) for i, text in enumerate(texts)]

    for coro in asyncio.as_completed(tasks):
        index, sentiment = await coro
        print(f"Item {index}: {sentiment}")

asyncio.run(main())

Integration with FastAPI

FastAPI is natively async, making it a natural fit:

from fastapi import FastAPI
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

@app.post("/analyze")
async def analyze_text(text: str):
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Analyze the sentiment of this text."},
            {"role": "user", "content": text},
        ],
    )
    return {"sentiment": response.choices[0].message.content}

FAQ

Should I create one AsyncOpenAI client or one per request?

Create one client and reuse it across all requests. The client manages an internal connection pool. Creating a new client per request wastes connections and adds overhead.

Can I mix sync and async OpenAI calls in the same application?

Yes, but keep them separate. Use OpenAI() for synchronous code and AsyncOpenAI() for async code. Do not call synchronous methods from within an async function — it blocks the event loop.

What is the ideal concurrency level for OpenAI API calls?

It depends on your rate limits. Check your plan's requests-per-minute (RPM) limit. A good starting point is a semaphore value of RPM divided by 6 (to account for variable request duration). Monitor 429 errors and adjust.

#OpenAI #AsyncPython #AsyncIO #Concurrency #Performance #AgenticAI #LearnAI #AIEngineering

Async OpenAI Client: Building High-Throughput AI Applications

Why Async Matters for AI Applications

The AsyncOpenAI Client

Concurrent Requests with asyncio.gather

Controlling Concurrency with Semaphores

Async Streaming

Processing Results as They Complete

Integration with FastAPI

FAQ

Should I create one AsyncOpenAI client or one per request?

Can I mix sync and async OpenAI calls in the same application?

What is the ideal concurrency level for OpenAI API calls?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding