MCP over HTTP: Building Remote Tool Servers with Streamable HTTP Transport

Beyond stdio: Remote MCP Servers

Stdio transport works perfectly when the MCP server runs on the same machine as the agent. But production scenarios often require remote access — a centralized database server accessed by multiple agents, a tool server running in a different cloud region, or a shared service that teams across the organization connect to.

The Streamable HTTP transport solves this. It uses standard HTTP POST for client-to-server messages and Server-Sent Events (SSE) for server-to-client streaming responses. This means MCP servers can run behind load balancers, in Kubernetes clusters, and behind standard HTTP infrastructure.

How Streamable HTTP Works

The transport exposes a single HTTP endpoint (typically /mcp). The client sends JSON-RPC messages via POST requests. The server responds with either a direct JSON response or opens an SSE stream for long-running operations:

# Simplified view of the HTTP message flow

# Client sends a request via POST
# POST /mcp
# Content-Type: application/json
# {
#     "jsonrpc": "2.0",
#     "id": 1,
#     "method": "tools/call",
#     "params": {"name": "query_db", "arguments": {"sql": "SELECT 1"}}
# }

# Server responds with SSE stream
# Content-Type: text/event-stream
#
# event: message
# data: {"jsonrpc":"2.0","id":1,"result":{"content":[...]}}

The SSE format allows the server to send multiple messages in a single response — progress notifications, partial results, and the final response — all over one HTTP connection.

Building an HTTP MCP Server in Python

FastMCP makes HTTP deployment straightforward:

# http_server.py
from mcp.server.fastmcp import FastMCP
import json

mcp_server = FastMCP(
    name="RemoteTools",
    instructions="Remote tool server accessible over HTTP.",
)


@mcp_server.tool()
async def analyze_text(text: str, analysis_type: str = "sentiment") -> str:
    """Analyze text using various NLP methods.

    Args:
        text: The text to analyze.
        analysis_type: Type of analysis - sentiment, entities, or summary.
    """
    # Simulated analysis (replace with real NLP logic)
    results = {
        "sentiment": {
            "label": "positive",
            "score": 0.87,
            "text_length": len(text),
        },
        "entities": {
            "entities_found": 3,
            "types": ["PERSON", "ORG", "DATE"],
        },
        "summary": {
            "original_length": len(text),
            "summary": text[:100] + "..." if len(text) > 100 else text,
        },
    }
    result = results.get(analysis_type, {"error": "Unknown analysis type"})
    return json.dumps(result, indent=2)


@mcp_server.tool()
async def health_check() -> str:
    """Return server health status."""
    from datetime import datetime
    return json.dumps({
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "version": "1.0.0",
    })


if __name__ == "__main__":
    mcp_server.run(
        transport="streamable-http",
        host="0.0.0.0",
        port=8001,
    )

This server listens on port 8001 and accepts MCP connections at /mcp by default.

Stateless vs Stateful Sessions

Streamable HTTP supports two session modes. Understanding when to use each is critical for scaling:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Stateless mode treats each HTTP request independently. There is no session ID, no server-side state between requests. This is ideal for servers that do not need to track conversation history or accumulate context across tool calls:

# Stateless server — each request is independent
# No session management needed
# Scales horizontally behind a load balancer with no sticky sessions

from mcp.server.fastmcp import FastMCP

stateless_server = FastMCP(name="StatelessTools")

@stateless_server.tool()
async def convert_units(
    value: float,
    from_unit: str,
    to_unit: str,
) -> str:
    """Convert between units. Each call is independent."""
    import json
    conversions = {
        ("celsius", "fahrenheit"): lambda v: v * 9 / 5 + 32,
        ("fahrenheit", "celsius"): lambda v: (v - 32) * 5 / 9,
        ("kg", "lb"): lambda v: v * 2.20462,
        ("lb", "kg"): lambda v: v / 2.20462,
    }
    key = (from_unit.lower(), to_unit.lower())
    fn = conversions.get(key)
    if not fn:
        return json.dumps({"error": f"Unknown conversion: {key}"})
    return json.dumps({
        "input": f"{value} {from_unit}",
        "output": f"{fn(value):.4f} {to_unit}",
    })

Stateful mode assigns a session ID to each client connection. The server maintains state between requests — useful for servers that manage transactions, accumulate context, or maintain resource subscriptions:

# Stateful server — tracks session context
# Requires sticky sessions or session store (Redis)

from mcp.server.fastmcp import FastMCP
import json

stateful_server = FastMCP(name="StatefulTools")

# Per-session state (in production, use Redis or similar)
_sessions: dict[str, dict] = {}


@stateful_server.tool()
async def start_transaction(session_id: str) -> str:
    """Begin a database transaction for this session."""
    _sessions[session_id] = {
        "transaction_active": True,
        "operations": [],
    }
    return json.dumps({"transaction": "started", "session": session_id})


@stateful_server.tool()
async def add_operation(session_id: str, operation: str) -> str:
    """Add an operation to the current transaction."""
    session = _sessions.get(session_id)
    if not session or not session["transaction_active"]:
        return json.dumps({"error": "No active transaction"})
    session["operations"].append(operation)
    return json.dumps({
        "queued": operation,
        "total_operations": len(session["operations"]),
    })

Scaling HTTP MCP Servers

For stateless servers, horizontal scaling is straightforward — run multiple instances behind a load balancer:

# docker-compose.yml pattern for scaling
# services:
#   mcp-tools:
#     image: mcp-tools:latest
#     deploy:
#       replicas: 3
#     ports:
#       - "8001-8003:8001"
#
#   nginx:
#     image: nginx:latest
#     ports:
#       - "80:80"
#     depends_on:
#       - mcp-tools

For stateful servers, you need either sticky sessions at the load balancer level or a shared session store like Redis. The session ID from the MCP protocol maps to a session key in Redis, allowing any server instance to resume a session.

Connecting Agents to HTTP Servers

From the agent side, connecting to an HTTP MCP server uses the streamable HTTP client:

from agents.mcp import MCPServerStreamableHTTP

remote_server = MCPServerStreamableHTTP(
    name="RemoteTools",
    params={
        "url": "https://mcp.internal.company.com/mcp",
        "headers": {
            "Authorization": "Bearer <token>",
        },
    },
    cache_tools_list=True,
)

The headers parameter lets you pass authentication tokens with every request. Combined with TLS, this provides a secure channel for remote MCP communication.

FAQ

Can I use an existing web framework like FastAPI or Express with MCP?

Yes. The MCP SDKs provide transport classes that integrate with existing HTTP frameworks. In Python, you can mount the MCP transport handler as a route in a Starlette or FastAPI application. In TypeScript, you can add it as an Express route. This lets you serve MCP alongside regular REST endpoints from the same application.

What is the performance overhead of HTTP transport vs stdio?

HTTP adds network latency (typically 1-10ms on a local network) and TLS handshake overhead for the first connection. For tool calls that take hundreds of milliseconds (database queries, API calls), this overhead is negligible. For extremely latency-sensitive tools, stdio is faster because it avoids the network stack entirely.

How do I handle timeouts for long-running tool calls over HTTP?

Set appropriate timeouts on both client and server. The SSE stream keeps the HTTP connection alive during long operations — the server can send progress notifications to prevent the client from timing out. Configure your load balancer and reverse proxy to allow long-lived connections for the MCP endpoint specifically.

#MCP #HTTP #SSE #Streaming #Deployment #AIAgents #AgenticAI #LearnAI #AIEngineering

MCP over HTTP: Building Remote Tool Servers with Streamable HTTP Transport

Beyond stdio: Remote MCP Servers

How Streamable HTTP Works

Building an HTTP MCP Server in Python

Stateless vs Stateful Sessions

Scaling HTTP MCP Servers

Connecting Agents to HTTP Servers

FAQ

Can I use an existing web framework like FastAPI or Express with MCP?

What is the performance overhead of HTTP transport vs stdio?

How do I handle timeouts for long-running tool calls over HTTP?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding