WebAssembly for AI Agents: Running Models in the Browser
Learn how to compile AI models to WebAssembly for browser-based agent inference, covering WASM compilation, model loading strategies, browser constraints, and progressive enhancement patterns.
Why Run AI Models in the Browser
Browser-based AI inference eliminates server costs, removes latency, and keeps data entirely on the user's device. With WebAssembly (WASM), you can run compiled C/C++ model inference engines at near-native speed inside any modern browser.
For AI agents, this means building chat interfaces, form assistants, or document analyzers that work without any backend — the model runs in the browser tab alongside your JavaScript application.
The WASM AI Stack
The typical stack for browser AI consists of three layers:
- Model format: ONNX, TFLite, or custom binary weights
- Runtime: ONNX Runtime Web (WASM backend), TFLite WASM, or custom C++ compiled to WASM
- JavaScript API: A thin wrapper that loads the WASM module and exposes inference functions
Loading ONNX Runtime Web
The fastest path to browser AI is ONNX Runtime Web, which provides both WASM and WebGL backends:
// Install: npm install onnxruntime-web
import * as ort from "onnxruntime-web";
// Configure WASM backend
ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
ort.env.wasm.simd = true;
async function loadAgentModel() {
const session = await ort.InferenceSession.create(
"/models/intent_classifier.onnx",
{
executionProviders: ["wasm"],
graphOptimizationLevel: "all",
}
);
return session;
}
async function classifyIntent(session, tokenIds) {
const inputTensor = new ort.Tensor(
"int64",
BigInt64Array.from(tokenIds.map(BigInt)),
[1, tokenIds.length]
);
const attentionMask = new ort.Tensor(
"int64",
BigInt64Array.from(tokenIds.map(() => BigInt(1))),
[1, tokenIds.length]
);
const results = await session.run({
input_ids: inputTensor,
attention_mask: attentionMask,
});
const logits = results.logits.data;
return softmax(Array.from(logits));
}
function softmax(arr) {
const max = Math.max(...arr);
const exps = arr.map((x) => Math.exp(x - max));
const sum = exps.reduce((a, b) => a + b, 0);
return exps.map((e) => e / sum);
}
Model Loading Strategies
Browser models can be large. A quantized DistilBERT is about 64 MB. Here are strategies to handle loading:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Lazy Loading with Progress
class BrowserAgent {
constructor() {
this.session = null;
this.loading = false;
}
async ensureModel(onProgress) {
if (this.session) return;
if (this.loading) return;
this.loading = true;
try {
// Check cache first
const cache = await caches.open("agent-models-v1");
const cached = await cache.match("/models/agent.onnx");
if (!cached) {
// Download with progress tracking
const response = await fetch("/models/agent.onnx");
const reader = response.body.getReader();
const contentLength = +response.headers.get("Content-Length");
let received = 0;
const chunks = [];
while (true) {
const { done, value } = await reader.read();
if (done) break;
chunks.push(value);
received += value.length;
onProgress?.(received / contentLength);
}
const blob = new Blob(chunks);
await cache.put("/models/agent.onnx", new Response(blob));
}
const modelResponse = await cache.match("/models/agent.onnx");
const buffer = await modelResponse.arrayBuffer();
this.session = await ort.InferenceSession.create(buffer, {
executionProviders: ["wasm"],
});
} finally {
this.loading = false;
}
}
}
Web Worker Isolation
Run inference in a Web Worker to keep the main thread responsive:
// agent-worker.js
import * as ort from "onnxruntime-web";
let session = null;
self.onmessage = async (event) => {
const { type, payload } = event.data;
if (type === "load") {
session = await ort.InferenceSession.create(payload.modelBuffer, {
executionProviders: ["wasm"],
});
self.postMessage({ type: "ready" });
}
if (type === "infer") {
const input = new ort.Tensor("int64",
BigInt64Array.from(payload.tokens.map(BigInt)),
[1, payload.tokens.length]
);
const results = await session.run({ input_ids: input });
self.postMessage({ type: "result", data: Array.from(results.logits.data) });
}
};
Browser Constraints
Running AI in the browser comes with hard limits:
- Memory: Browsers typically cap WASM memory at 2 to 4 GB. Models larger than about 1 GB become impractical.
- Startup time: WASM compilation of large modules takes 1 to 5 seconds on first load.
- No GPU from WASM: WASM itself runs on CPU. For GPU, you need WebGL or WebGPU backends.
- Thread limitations: SharedArrayBuffer (required for multi-threaded WASM) needs cross-origin isolation headers.
Progressive Enhancement Pattern
Build your agent to work without the local model, then enhance when it loads:
class ProgressiveAgent {
constructor(apiEndpoint) {
this.apiEndpoint = apiEndpoint;
this.localModel = null;
this.loadLocalModel();
}
async loadLocalModel() {
try {
const session = await ort.InferenceSession.create("/models/agent.onnx");
this.localModel = session;
console.log("Local model loaded — switching to browser inference");
} catch (err) {
console.warn("Local model unavailable, using cloud fallback", err);
}
}
async processInput(text) {
if (this.localModel) {
return this.inferLocally(text);
}
return this.inferViaAPI(text);
}
async inferViaAPI(text) {
const res = await fetch(this.apiEndpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text }),
});
return res.json();
}
async inferLocally(text) {
// Tokenize and run through local ONNX model
const tokens = this.tokenize(text);
// ... run inference as shown above
}
}
FAQ
How does WASM AI performance compare to native?
WASM inference is typically 2 to 4 times slower than native C++ for the same model on the same hardware. However, with SIMD instructions enabled and multi-threading via SharedArrayBuffer, the gap narrows to 1.5 to 2 times. For many agent tasks like classification or embedding, the absolute latency (20 to 50 milliseconds) is fast enough to feel instant.
Which browsers support WASM AI workloads?
All modern browsers — Chrome 57 and later, Firefox 52 and later, Safari 11 and later, and Edge 16 and later — support WebAssembly. For multi-threaded WASM (which ONNX Runtime Web uses for performance), you need SharedArrayBuffer support, which requires cross-origin isolation headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp.
Can I run large language models in the browser with WASM?
Small language models up to about 1 billion parameters (quantized) can run in the browser, though generation is slow — roughly 2 to 10 tokens per second. For practical browser-based agents, use smaller specialized models for intent routing and tool selection, and reserve cloud LLMs for complex generation tasks.
#WebAssembly #BrowserAI #WASM #JavaScript #ClientSideAI #EdgeAI #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.