Skip to content
Learn Agentic AI10 min read0 views

WebAssembly for AI Agents: Running Models in the Browser

Learn how to compile AI models to WebAssembly for browser-based agent inference, covering WASM compilation, model loading strategies, browser constraints, and progressive enhancement patterns.

Why Run AI Models in the Browser

Browser-based AI inference eliminates server costs, removes latency, and keeps data entirely on the user's device. With WebAssembly (WASM), you can run compiled C/C++ model inference engines at near-native speed inside any modern browser.

For AI agents, this means building chat interfaces, form assistants, or document analyzers that work without any backend — the model runs in the browser tab alongside your JavaScript application.

The WASM AI Stack

The typical stack for browser AI consists of three layers:

  1. Model format: ONNX, TFLite, or custom binary weights
  2. Runtime: ONNX Runtime Web (WASM backend), TFLite WASM, or custom C++ compiled to WASM
  3. JavaScript API: A thin wrapper that loads the WASM module and exposes inference functions

Loading ONNX Runtime Web

The fastest path to browser AI is ONNX Runtime Web, which provides both WASM and WebGL backends:

// Install: npm install onnxruntime-web

import * as ort from "onnxruntime-web";

// Configure WASM backend
ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
ort.env.wasm.simd = true;

async function loadAgentModel() {
  const session = await ort.InferenceSession.create(
    "/models/intent_classifier.onnx",
    {
      executionProviders: ["wasm"],
      graphOptimizationLevel: "all",
    }
  );
  return session;
}

async function classifyIntent(session, tokenIds) {
  const inputTensor = new ort.Tensor(
    "int64",
    BigInt64Array.from(tokenIds.map(BigInt)),
    [1, tokenIds.length]
  );

  const attentionMask = new ort.Tensor(
    "int64",
    BigInt64Array.from(tokenIds.map(() => BigInt(1))),
    [1, tokenIds.length]
  );

  const results = await session.run({
    input_ids: inputTensor,
    attention_mask: attentionMask,
  });

  const logits = results.logits.data;
  return softmax(Array.from(logits));
}

function softmax(arr) {
  const max = Math.max(...arr);
  const exps = arr.map((x) => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map((e) => e / sum);
}

Model Loading Strategies

Browser models can be large. A quantized DistilBERT is about 64 MB. Here are strategies to handle loading:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Lazy Loading with Progress

class BrowserAgent {
  constructor() {
    this.session = null;
    this.loading = false;
  }

  async ensureModel(onProgress) {
    if (this.session) return;
    if (this.loading) return;
    this.loading = true;

    try {
      // Check cache first
      const cache = await caches.open("agent-models-v1");
      const cached = await cache.match("/models/agent.onnx");

      if (!cached) {
        // Download with progress tracking
        const response = await fetch("/models/agent.onnx");
        const reader = response.body.getReader();
        const contentLength = +response.headers.get("Content-Length");
        let received = 0;
        const chunks = [];

        while (true) {
          const { done, value } = await reader.read();
          if (done) break;
          chunks.push(value);
          received += value.length;
          onProgress?.(received / contentLength);
        }

        const blob = new Blob(chunks);
        await cache.put("/models/agent.onnx", new Response(blob));
      }

      const modelResponse = await cache.match("/models/agent.onnx");
      const buffer = await modelResponse.arrayBuffer();
      this.session = await ort.InferenceSession.create(buffer, {
        executionProviders: ["wasm"],
      });
    } finally {
      this.loading = false;
    }
  }
}

Web Worker Isolation

Run inference in a Web Worker to keep the main thread responsive:

// agent-worker.js
import * as ort from "onnxruntime-web";

let session = null;

self.onmessage = async (event) => {
  const { type, payload } = event.data;

  if (type === "load") {
    session = await ort.InferenceSession.create(payload.modelBuffer, {
      executionProviders: ["wasm"],
    });
    self.postMessage({ type: "ready" });
  }

  if (type === "infer") {
    const input = new ort.Tensor("int64",
      BigInt64Array.from(payload.tokens.map(BigInt)),
      [1, payload.tokens.length]
    );
    const results = await session.run({ input_ids: input });
    self.postMessage({ type: "result", data: Array.from(results.logits.data) });
  }
};

Browser Constraints

Running AI in the browser comes with hard limits:

  • Memory: Browsers typically cap WASM memory at 2 to 4 GB. Models larger than about 1 GB become impractical.
  • Startup time: WASM compilation of large modules takes 1 to 5 seconds on first load.
  • No GPU from WASM: WASM itself runs on CPU. For GPU, you need WebGL or WebGPU backends.
  • Thread limitations: SharedArrayBuffer (required for multi-threaded WASM) needs cross-origin isolation headers.

Progressive Enhancement Pattern

Build your agent to work without the local model, then enhance when it loads:

class ProgressiveAgent {
  constructor(apiEndpoint) {
    this.apiEndpoint = apiEndpoint;
    this.localModel = null;
    this.loadLocalModel();
  }

  async loadLocalModel() {
    try {
      const session = await ort.InferenceSession.create("/models/agent.onnx");
      this.localModel = session;
      console.log("Local model loaded — switching to browser inference");
    } catch (err) {
      console.warn("Local model unavailable, using cloud fallback", err);
    }
  }

  async processInput(text) {
    if (this.localModel) {
      return this.inferLocally(text);
    }
    return this.inferViaAPI(text);
  }

  async inferViaAPI(text) {
    const res = await fetch(this.apiEndpoint, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ text }),
    });
    return res.json();
  }

  async inferLocally(text) {
    // Tokenize and run through local ONNX model
    const tokens = this.tokenize(text);
    // ... run inference as shown above
  }
}

FAQ

How does WASM AI performance compare to native?

WASM inference is typically 2 to 4 times slower than native C++ for the same model on the same hardware. However, with SIMD instructions enabled and multi-threading via SharedArrayBuffer, the gap narrows to 1.5 to 2 times. For many agent tasks like classification or embedding, the absolute latency (20 to 50 milliseconds) is fast enough to feel instant.

Which browsers support WASM AI workloads?

All modern browsers — Chrome 57 and later, Firefox 52 and later, Safari 11 and later, and Edge 16 and later — support WebAssembly. For multi-threaded WASM (which ONNX Runtime Web uses for performance), you need SharedArrayBuffer support, which requires cross-origin isolation headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp.

Can I run large language models in the browser with WASM?

Small language models up to about 1 billion parameters (quantized) can run in the browser, though generation is slow — roughly 2 to 10 tokens per second. For practical browser-based agents, use smaller specialized models for intent routing and tool selection, and reserve cloud LLMs for complex generation tasks.


#WebAssembly #BrowserAI #WASM #JavaScript #ClientSideAI #EdgeAI #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.