Skip to content
Back to Blog
Agentic AI6 min read

Serverless AI: Running LLM Workloads on AWS Lambda and Cloud Functions

Explore the architecture, limitations, and practical patterns for running LLM inference and AI workloads on serverless platforms like AWS Lambda and Google Cloud Functions.

Serverless Meets AI: Opportunity and Constraints

Serverless computing promises automatic scaling, zero idle costs, and operational simplicity. AI workloads demand high memory, long execution times, and GPU access. These two worlds seem incompatible -- and for self-hosted model inference, they largely are. But for applications that call external LLM APIs (Anthropic, OpenAI, Google), serverless platforms offer a compelling deployment model.

The key insight is that most production AI applications are not running inference locally. They are orchestrating API calls, processing results, managing conversation state, and integrating with other services. These orchestration workloads are an excellent fit for serverless.

Architecture Patterns

Pattern 1: API Gateway + Lambda for LLM Orchestration

The most common pattern uses Lambda functions as the orchestration layer that calls external LLM APIs:

# lambda_function.py
import json
import os
import anthropic
from typing import Any

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def handler(event: dict, context: Any) -> dict:
    """Lambda handler for LLM-powered API endpoint."""
    body = json.loads(event.get("body", "{}"))
    user_query = body.get("query", "")

    if not user_query:
        return {
            "statusCode": 400,
            "body": json.dumps({"error": "query is required"})
        }

    try:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[{"role": "user", "content": user_query}]
        )

        return {
            "statusCode": 200,
            "headers": {"Content-Type": "application/json"},
            "body": json.dumps({
                "answer": response.content[0].text,
                "usage": {
                    "input_tokens": response.usage.input_tokens,
                    "output_tokens": response.usage.output_tokens
                }
            })
        }
    except anthropic.RateLimitError:
        return {"statusCode": 429, "body": json.dumps({"error": "Rate limited"})}
    except anthropic.APIError as e:
        return {"statusCode": 502, "body": json.dumps({"error": str(e)})}

Pattern 2: Step Functions for Multi-Step AI Pipelines

For complex AI workflows that exceed Lambda's 15-minute timeout or require branching logic, AWS Step Functions orchestrate multiple Lambda functions:

{
  "Comment": "RAG Pipeline with Step Functions",
  "StartAt": "ParseQuery",
  "States": {
    "ParseQuery": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456:function:parse-query",
      "Next": "ParallelRetrieval"
    },
    "ParallelRetrieval": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "VectorSearch",
          "States": {
            "VectorSearch": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456:function:vector-search",
              "Retry": [{"ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 2}],
              "End": true
            }
          }
        },
        {
          "StartAt": "KeywordSearch",
          "States": {
            "KeywordSearch": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456:function:keyword-search",
              "Retry": [{"ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 2}],
              "End": true
            }
          }
        }
      ],
      "Next": "MergeAndSynthesize"
    },
    "MergeAndSynthesize": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456:function:llm-synthesize",
      "TimeoutSeconds": 120,
      "Next": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

Pattern 3: Event-Driven AI Processing

Use Lambda with SQS or EventBridge for asynchronous AI workloads like document processing, email analysis, or batch summarization:

# Triggered by SQS messages containing documents to process
def document_processor(event: dict, context: Any) -> dict:
    """Process documents asynchronously via SQS trigger."""
    results = []

    for record in event["Records"]:
        message = json.loads(record["body"])
        doc_id = message["document_id"]
        doc_text = fetch_document(doc_id)

        # Summarize with LLM
        summary = client.messages.create(
            model="claude-haiku-3-5-20241022",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": f"Summarize this document in 3 sentences:\n\n{doc_text[:10000]}"
            }]
        )

        # Store result
        store_summary(doc_id, summary.content[0].text)
        results.append({"doc_id": doc_id, "status": "processed"})

    return {"processed": len(results)}

Lambda Constraints and Workarounds

Timeout Limits

AWS Lambda has a 15-minute maximum execution time. LLM API calls with large contexts can take 30-60 seconds, and complex multi-step pipelines may exceed the limit.

Workarounds:

  • Use Step Functions to chain multiple Lambda invocations
  • Implement streaming responses with Lambda response streaming (up to 20 minutes)
  • Use Lambda function URLs with response streaming for real-time applications
# Lambda response streaming for LLM output
def handler(event, context):
    """Stream LLM response using Lambda response streaming."""
    import awslambdaric.lambda_context as lc

    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[{"role": "user", "content": event["query"]}]
        ) as stream:
            for text in stream.text_stream:
                yield text.encode("utf-8")

    return {
        "statusCode": 200,
        "headers": {"Content-Type": "text/plain"},
        "body": generate(),
        "isBase64Encoded": False
    }

Memory Limits

Lambda supports up to 10 GB of memory. For AI workloads that need to load embeddings, models, or large datasets into memory, this can be a constraint.

Workarounds:

  • Use external services for heavy computation (managed vector databases, embedding APIs)
  • Stream data from S3 instead of loading it all into memory
  • Use Lambda Layers for shared dependencies to reduce package size

Cold Start Latency

Lambda cold starts add 1-5 seconds of latency. For AI applications where users expect fast responses, this is significant.

Workarounds:

  • Use provisioned concurrency to keep functions warm
  • Use SnapStart (Java) or equivalent initialization optimizations
  • Initialize API clients outside the handler function
# Initialize client OUTSIDE the handler for connection reuse
client = anthropic.Anthropic()

def handler(event, context):
    # client is reused across invocations in the same execution environment
    response = client.messages.create(...)
    return response

Cost Comparison: Serverless vs. Containers

Factor Lambda ECS/Fargate EKS
Idle cost $0 $0 (Fargate) ~$70/mo (control plane)
Per-request cost $0.0000133/GB-s ~$0.000004/vCPU-s ~$0.000003/vCPU-s
Scale-to-zero Yes Yes (Fargate) With KEDA
Cold start 1-5s 30-60s 30-60s (new pods)
Max memory 10 GB 120 GB Node-dependent
Max timeout 15 min Unlimited Unlimited
GPU support No Yes Yes

When to choose serverless for AI:

  • Low to moderate request volume (under 10,000 concurrent)
  • API-calling workloads (not self-hosted inference)
  • Bursty traffic patterns with periods of zero usage
  • Teams that want minimal infrastructure management

When to choose containers:

  • Self-hosted model inference requiring GPUs
  • Sustained high-throughput workloads
  • Complex stateful pipelines exceeding 15 minutes
  • Applications requiring more than 10 GB memory

Google Cloud Functions and Azure Functions

The patterns are similar across cloud providers:

# Google Cloud Function
import functions_framework
from anthropic import Anthropic

client = Anthropic()

@functions_framework.http
def ai_endpoint(request):
    data = request.get_json()
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": data["query"]}]
    )
    return {"answer": response.content[0].text}

Google Cloud Functions gen2 supports up to 60 minutes of execution time and 32 GB of memory, making it more suitable for longer AI workloads than Lambda.

Production Checklist for Serverless AI

  1. Set concurrency limits to avoid hitting LLM API rate limits
  2. Configure dead-letter queues for failed async processing
  3. Use structured logging (JSON) for observability
  4. Set memory to 1-2 GB minimum for Python AI workloads (faster cold starts)
  5. Enable X-Ray/Cloud Trace for end-to-end request tracing
  6. Store API keys in Secrets Manager, not environment variables
  7. Set reserved concurrency to prevent runaway scaling costs

Conclusion

Serverless is not the right platform for self-hosted model inference, but it is an excellent platform for AI orchestration workloads that call external LLM APIs. The combination of zero idle cost, automatic scaling, and minimal operational overhead makes serverless compelling for AI applications with variable traffic. Design around the constraints -- timeouts, memory limits, and cold starts -- and serverless AI can be both cost-effective and reliable.

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.