Serverless AI Agents: Running Agents on AWS Lambda and Cloud Functions

Why Serverless for AI Agents

Serverless platforms scale to zero when there is no traffic and scale to thousands of concurrent executions when demand spikes — without you managing a single server. For AI agent workloads with unpredictable traffic patterns, this translates to significant cost savings. You pay only for the milliseconds your agent is actively processing, not for idle pods waiting for requests.

However, serverless introduces constraints that require careful design: cold starts add latency, execution timeouts limit long-running agent tasks, there is no persistent local state, and you cannot maintain WebSocket connections. Understanding these tradeoffs helps you decide which agent workloads belong on Lambda and which need dedicated infrastructure.

When Serverless Works for AI Agents

Serverless is a good fit when your agent: handles simple single-turn queries with response times under 60 seconds, has bursty traffic with quiet periods, does not require persistent in-memory state between requests, and calls external LLM APIs rather than running local models.

Serverless is a poor fit when: you need WebSocket streaming, responses take longer than the platform timeout, the agent requires GPU inference, or you need persistent connections to databases that cannot handle connection surge.

AWS Lambda Agent with Python

Here is a complete Lambda function that runs an AI agent:

# lambda_function.py
import json
import os
import uuid
import boto3
from agents import Agent, Runner

# Initialize outside handler for connection reuse across invocations
agent = Agent(
    name="assistant",
    instructions="You are a helpful assistant. Keep responses concise.",
    model=os.environ.get("AGENT_MODEL", "gpt-4o-mini"),
)

# DynamoDB for session persistence
dynamodb = boto3.resource("dynamodb")
sessions_table = dynamodb.Table(os.environ["SESSIONS_TABLE"])


def get_session_history(session_id: str) -> list:
    """Load conversation history from DynamoDB."""
    try:
        response = sessions_table.get_item(Key={"session_id": session_id})
        return response.get("Item", {}).get("history", [])
    except Exception:
        return []


def save_session_history(session_id: str, history: list):
    """Persist conversation history to DynamoDB."""
    sessions_table.put_item(Item={
        "session_id": session_id,
        "history": history,
        "ttl": int(__import__("time").time()) + 3600,  # 1 hour TTL
    })


def handler(event, context):
    try:
        body = json.loads(event.get("body", "{}"))
        message = body.get("message", "")
        session_id = body.get("session_id") or str(uuid.uuid4())

        if not message:
            return {
                "statusCode": 400,
                "body": json.dumps({"error": "message is required"}),
            }

        history = get_session_history(session_id)

        # Run the agent synchronously (Lambda does not support async handlers)
        import asyncio
        result = asyncio.get_event_loop().run_until_complete(
            Runner.run(agent, message, message_history=history)
        )

        new_history = result.to_input_list()
        save_session_history(session_id, new_history)

        return {
            "statusCode": 200,
            "headers": {"Content-Type": "application/json"},
            "body": json.dumps({
                "session_id": session_id,
                "reply": result.final_output,
                "remaining_time_ms": context.get_remaining_time_in_millis(),
            }),
        }

    except Exception as e:
        return {
            "statusCode": 500,
            "body": json.dumps({"error": str(e)}),
        }

Infrastructure as Code with SAM

Define your Lambda and API Gateway with AWS SAM:

# template.yaml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Timeout: 90
    MemorySize: 512
    Runtime: python3.12
    Environment:
      Variables:
        AGENT_MODEL: gpt-4o-mini
        SESSIONS_TABLE: !Ref SessionsTable

Resources:
  AgentFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: lambda_function.handler
      CodeUri: src/
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref SessionsTable
      Events:
        AgentApi:
          Type: Api
          Properties:
            Path: /agent/chat
            Method: post

  SessionsTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: agent-sessions
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: session_id
          AttributeType: S
      KeySchema:
        - AttributeName: session_id
          KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: ttl
        Enabled: true

Deploy with:

sam build
sam deploy --guided

Cold Start Optimization

Cold starts happen when Lambda creates a new execution environment. For Python-based agents, this adds 1-3 seconds of latency. Minimize it:

# Move all imports and initialization outside the handler
import json          # These run during cold start, then are cached
import os
import boto3
from agents import Agent, Runner

agent = Agent(...)   # Initialized once, reused across invocations
dynamodb = boto3.resource("dynamodb")  # Connection reused

def handler(event, context):
    # Only request-specific logic here
    pass

Use provisioned concurrency to keep warm instances ready:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# In SAM template
AgentFunction:
  Type: AWS::Serverless::Function
  Properties:
    ProvisionedConcurrencyConfig:
      ProvisionedConcurrentExecutions: 5

This keeps 5 instances warm at all times, eliminating cold starts for the first 5 concurrent requests.

Handling Timeouts Gracefully

Lambda has a maximum timeout of 15 minutes (API Gateway timeout is 29 seconds). Check remaining time and fail gracefully:

def handler(event, context):
    remaining_ms = context.get_remaining_time_in_millis()

    if remaining_ms < 10000:  # Less than 10 seconds left
        return {
            "statusCode": 503,
            "body": json.dumps({
                "error": "Insufficient time remaining",
                "suggestion": "Use async processing for complex queries",
            }),
        }

    # For long-running tasks, use Step Functions instead
    pass

Cost Comparison: Serverless vs. Kubernetes

For an agent service handling 10,000 requests per day with an average execution time of 5 seconds:

AWS Lambda: 10,000 requests x 5 seconds x 512 MB = 25,000 GB-seconds/day. At $0.0000166667 per GB-second, that is roughly $12.50/month plus API Gateway costs.

Kubernetes (2 pods, t3.medium): 2 x $30/month = $60/month, running 24/7 regardless of traffic.

Lambda wins for bursty, low-to-moderate traffic. Kubernetes wins for sustained high traffic where pods stay utilized.

Stateless Design Pattern

Since Lambda instances are ephemeral, externalize all state:

# Session state -> DynamoDB
# Cache -> ElastiCache/Redis
# File uploads -> S3
# Task queues -> SQS
# Conversation history -> DynamoDB with TTL

Never rely on /tmp storage or global variables persisting between invocations — they might, but Lambda provides no guarantee.

FAQ

Can I stream AI agent responses from AWS Lambda?

Lambda itself does not support SSE or WebSocket streaming. However, you can use Lambda Function URLs with response streaming enabled — this allows chunked transfer encoding. Alternatively, use API Gateway WebSocket APIs backed by Lambda for bidirectional streaming, though this adds architectural complexity. For simple streaming, consider keeping a dedicated FastAPI service for the streaming endpoint while using Lambda for batch processing.

How do I handle Lambda's 6 MB response payload limit?

For AI agents, 6 MB is typically more than enough for text responses. If your agent generates large outputs (like code generation or document creation), write the output to S3 and return a pre-signed URL in the Lambda response. Set the URL to expire after a reasonable period, like 15 minutes.

Is provisioned concurrency worth the cost for AI agent Lambdas?

It depends on your latency requirements. Provisioned concurrency costs roughly the same as running an equivalent EC2 instance 24/7. If your agents serve user-facing requests where a 2-3 second cold start is unacceptable, provisioned concurrency is worth it. If the agent runs background tasks where latency is not critical, on-demand concurrency is more cost-effective. Start without it and add provisioned concurrency only for latency-sensitive paths.

#Serverless #AWSLambda #AIAgents #CloudFunctions #CostOptimization #AgenticAI #LearnAI #AIEngineering