Serverless AI: Running LLM Workloads on AWS Lambda and Cloud Functions
Explore the architecture, limitations, and practical patterns for running LLM inference and AI workloads on serverless platforms like AWS Lambda and Google Cloud Functions.
Serverless Meets AI: Opportunity and Constraints
Serverless computing promises automatic scaling, zero idle costs, and operational simplicity. AI workloads demand high memory, long execution times, and GPU access. These two worlds seem incompatible -- and for self-hosted model inference, they largely are. But for applications that call external LLM APIs (Anthropic, OpenAI, Google), serverless platforms offer a compelling deployment model.
The key insight is that most production AI applications are not running inference locally. They are orchestrating API calls, processing results, managing conversation state, and integrating with other services. These orchestration workloads are an excellent fit for serverless.
Architecture Patterns
Pattern 1: API Gateway + Lambda for LLM Orchestration
The most common pattern uses Lambda functions as the orchestration layer that calls external LLM APIs:
# lambda_function.py
import json
import os
import anthropic
from typing import Any
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def handler(event: dict, context: Any) -> dict:
"""Lambda handler for LLM-powered API endpoint."""
body = json.loads(event.get("body", "{}"))
user_query = body.get("query", "")
if not user_query:
return {
"statusCode": 400,
"body": json.dumps({"error": "query is required"})
}
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": user_query}]
)
return {
"statusCode": 200,
"headers": {"Content-Type": "application/json"},
"body": json.dumps({
"answer": response.content[0].text,
"usage": {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens
}
})
}
except anthropic.RateLimitError:
return {"statusCode": 429, "body": json.dumps({"error": "Rate limited"})}
except anthropic.APIError as e:
return {"statusCode": 502, "body": json.dumps({"error": str(e)})}
Pattern 2: Step Functions for Multi-Step AI Pipelines
For complex AI workflows that exceed Lambda's 15-minute timeout or require branching logic, AWS Step Functions orchestrate multiple Lambda functions:
{
"Comment": "RAG Pipeline with Step Functions",
"StartAt": "ParseQuery",
"States": {
"ParseQuery": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:parse-query",
"Next": "ParallelRetrieval"
},
"ParallelRetrieval": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "VectorSearch",
"States": {
"VectorSearch": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:vector-search",
"Retry": [{"ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 2}],
"End": true
}
}
},
{
"StartAt": "KeywordSearch",
"States": {
"KeywordSearch": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:keyword-search",
"Retry": [{"ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 2}],
"End": true
}
}
}
],
"Next": "MergeAndSynthesize"
},
"MergeAndSynthesize": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:llm-synthesize",
"TimeoutSeconds": 120,
"Next": "Done"
},
"Done": {
"Type": "Succeed"
}
}
}
Pattern 3: Event-Driven AI Processing
Use Lambda with SQS or EventBridge for asynchronous AI workloads like document processing, email analysis, or batch summarization:
# Triggered by SQS messages containing documents to process
def document_processor(event: dict, context: Any) -> dict:
"""Process documents asynchronously via SQS trigger."""
results = []
for record in event["Records"]:
message = json.loads(record["body"])
doc_id = message["document_id"]
doc_text = fetch_document(doc_id)
# Summarize with LLM
summary = client.messages.create(
model="claude-haiku-3-5-20241022",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Summarize this document in 3 sentences:\n\n{doc_text[:10000]}"
}]
)
# Store result
store_summary(doc_id, summary.content[0].text)
results.append({"doc_id": doc_id, "status": "processed"})
return {"processed": len(results)}
Lambda Constraints and Workarounds
Timeout Limits
AWS Lambda has a 15-minute maximum execution time. LLM API calls with large contexts can take 30-60 seconds, and complex multi-step pipelines may exceed the limit.
Workarounds:
- Use Step Functions to chain multiple Lambda invocations
- Implement streaming responses with Lambda response streaming (up to 20 minutes)
- Use Lambda function URLs with response streaming for real-time applications
# Lambda response streaming for LLM output
def handler(event, context):
"""Stream LLM response using Lambda response streaming."""
import awslambdaric.lambda_context as lc
def generate():
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": event["query"]}]
) as stream:
for text in stream.text_stream:
yield text.encode("utf-8")
return {
"statusCode": 200,
"headers": {"Content-Type": "text/plain"},
"body": generate(),
"isBase64Encoded": False
}
Memory Limits
Lambda supports up to 10 GB of memory. For AI workloads that need to load embeddings, models, or large datasets into memory, this can be a constraint.
Workarounds:
- Use external services for heavy computation (managed vector databases, embedding APIs)
- Stream data from S3 instead of loading it all into memory
- Use Lambda Layers for shared dependencies to reduce package size
Cold Start Latency
Lambda cold starts add 1-5 seconds of latency. For AI applications where users expect fast responses, this is significant.
Workarounds:
- Use provisioned concurrency to keep functions warm
- Use SnapStart (Java) or equivalent initialization optimizations
- Initialize API clients outside the handler function
# Initialize client OUTSIDE the handler for connection reuse
client = anthropic.Anthropic()
def handler(event, context):
# client is reused across invocations in the same execution environment
response = client.messages.create(...)
return response
Cost Comparison: Serverless vs. Containers
| Factor | Lambda | ECS/Fargate | EKS |
|---|---|---|---|
| Idle cost | $0 | $0 (Fargate) | ~$70/mo (control plane) |
| Per-request cost | $0.0000133/GB-s | ~$0.000004/vCPU-s | ~$0.000003/vCPU-s |
| Scale-to-zero | Yes | Yes (Fargate) | With KEDA |
| Cold start | 1-5s | 30-60s | 30-60s (new pods) |
| Max memory | 10 GB | 120 GB | Node-dependent |
| Max timeout | 15 min | Unlimited | Unlimited |
| GPU support | No | Yes | Yes |
When to choose serverless for AI:
- Low to moderate request volume (under 10,000 concurrent)
- API-calling workloads (not self-hosted inference)
- Bursty traffic patterns with periods of zero usage
- Teams that want minimal infrastructure management
When to choose containers:
- Self-hosted model inference requiring GPUs
- Sustained high-throughput workloads
- Complex stateful pipelines exceeding 15 minutes
- Applications requiring more than 10 GB memory
Google Cloud Functions and Azure Functions
The patterns are similar across cloud providers:
# Google Cloud Function
import functions_framework
from anthropic import Anthropic
client = Anthropic()
@functions_framework.http
def ai_endpoint(request):
data = request.get_json()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": data["query"]}]
)
return {"answer": response.content[0].text}
Google Cloud Functions gen2 supports up to 60 minutes of execution time and 32 GB of memory, making it more suitable for longer AI workloads than Lambda.
Production Checklist for Serverless AI
- Set concurrency limits to avoid hitting LLM API rate limits
- Configure dead-letter queues for failed async processing
- Use structured logging (JSON) for observability
- Set memory to 1-2 GB minimum for Python AI workloads (faster cold starts)
- Enable X-Ray/Cloud Trace for end-to-end request tracing
- Store API keys in Secrets Manager, not environment variables
- Set reserved concurrency to prevent runaway scaling costs
Conclusion
Serverless is not the right platform for self-hosted model inference, but it is an excellent platform for AI orchestration workloads that call external LLM APIs. The combination of zero idle cost, automatic scaling, and minimal operational overhead makes serverless compelling for AI applications with variable traffic. Design around the constraints -- timeouts, memory limits, and cold starts -- and serverless AI can be both cost-effective and reliable.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.