Building a Real-Time AI Coding Assistant: Streaming Code Suggestions and Explanations

Architecture of a Real-Time Coding Assistant

A real-time coding assistant must balance three competing demands: responsiveness (suggestions should appear within hundreds of milliseconds of a keystroke), accuracy (suggestions need enough context to be relevant), and efficiency (you cannot send an LLM request on every single character typed). The architecture solves this with four components: an editor integration layer that captures context, a debouncing mechanism that batches rapid inputs, a context extraction pipeline that selects the most relevant code, and a streaming response handler that renders suggestions progressively.

Editor Integration: Capturing Context

The editor extension captures the cursor position, surrounding code, file path, and language. This context travels to the backend with each suggestion request.

interface EditorContext {
  filePath: string;
  language: string;
  cursorLine: number;
  cursorColumn: number;
  prefix: string;     // Code before cursor (up to N lines)
  suffix: string;     // Code after cursor (up to N lines)
  selection: string;  // Currently selected text, if any
  openFiles: string[]; // Other open file paths for cross-file context
}

function extractContext(
  editor: { document: any; selection: any },
  contextLines: number = 50
): EditorContext {
  const doc = editor.document;
  const pos = editor.selection.active;
  const totalLines = doc.lineCount;

  const prefixStart = Math.max(0, pos.line - contextLines);
  const suffixEnd = Math.min(totalLines, pos.line + contextLines);

  const prefix = doc.getText({
    start: { line: prefixStart, character: 0 },
    end: { line: pos.line, character: pos.character },
  });

  const suffix = doc.getText({
    start: { line: pos.line, character: pos.character },
    end: { line: suffixEnd, character: 0 },
  });

  return {
    filePath: doc.fileName,
    language: doc.languageId,
    cursorLine: pos.line,
    cursorColumn: pos.character,
    prefix,
    suffix,
    selection: doc.getText(editor.selection),
    openFiles: getOpenFilePaths(),
  };
}

The prefix and suffix together form the "fill-in-the-middle" (FIM) context that most code completion models use. Limiting to 50 lines in each direction keeps the request payload manageable while providing enough context for accurate suggestions.

Debouncing: Not Every Keystroke Needs a Request

Sending a request on every keystroke would flood the server and waste tokens. Debouncing waits for a pause in typing before triggering a request. The right debounce interval depends on the interaction type.

class SuggestionDebouncer {
  private timer: NodeJS.Timeout | null = null;
  private abortController: AbortController | null = null;
  private lastRequestTime = 0;

  constructor(
    private completionDelay: number = 300,  // ms after typing stops
    private explainDelay: number = 500,     // longer for explanation requests
  ) {}

  requestCompletion(
    context: EditorContext,
    callback: (ctx: EditorContext, signal: AbortSignal) => void
  ): void {
    // Cancel any pending request
    if (this.timer) clearTimeout(this.timer);
    if (this.abortController) this.abortController.abort();

    this.abortController = new AbortController();
    const signal = this.abortController.signal;

    this.timer = setTimeout(() => {
      this.lastRequestTime = Date.now();
      callback(context, signal);
    }, this.completionDelay);
  }

  requestExplanation(
    context: EditorContext,
    callback: (ctx: EditorContext, signal: AbortSignal) => void
  ): void {
    if (this.timer) clearTimeout(this.timer);
    if (this.abortController) this.abortController.abort();

    this.abortController = new AbortController();
    const signal = this.abortController.signal;

    this.timer = setTimeout(() => {
      callback(context, signal);
    }, this.explainDelay);
  }

  cancel(): void {
    if (this.timer) clearTimeout(this.timer);
    if (this.abortController) this.abortController.abort();
  }
}

Two key details: the AbortController cancels in-flight HTTP requests when the user continues typing (so stale results never appear), and the delay is shorter for completions (300ms) than explanations (500ms) because completions are expected inline while the user types, whereas explanations are explicit user actions.

Backend: Context-Aware Completion

The backend receives the editor context, enriches it with cross-file context if available, and streams the completion.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Optional
import json

app = FastAPI()

class CompletionRequest(BaseModel):
    file_path: str
    language: str
    prefix: str
    suffix: str
    cursor_line: int
    cursor_column: int
    selection: Optional[str] = None
    open_files: list[str] = []
    max_tokens: int = 256

def build_fim_prompt(req: CompletionRequest) -> dict:
    """Build a fill-in-the-middle prompt for code completion."""
    system_prompt = (
        f"You are a code completion engine for {req.language}. "
        "Complete the code at the cursor position. "
        "Output ONLY the completion, no explanation, no markdown."
    )

    # Trim context to fit token budget
    max_context_chars = 8000
    prefix = req.prefix[-max_context_chars:]
    suffix = req.suffix[:max_context_chars // 2]

    user_prompt = f"<prefix>{prefix}<cursor>{suffix}<suffix>"

    return {"system": system_prompt, "user": user_prompt}

async def stream_completion(req: CompletionRequest):
    prompt = build_fim_prompt(req)

    async for chunk in call_llm_streaming(
        system=prompt["system"],
        user=prompt["user"],
        max_tokens=req.max_tokens,
        temperature=0.1,  # Low temperature for deterministic completions
        stop=["\n\n", "\nclass ", "\ndef ", "\n#"],  # Stop at logical boundaries
    ):
        data = json.dumps({"token": chunk, "done": False})
        yield f"data: {data}\n\n"

    yield f"data: {json.dumps({'token': '', 'done': True})}\n\n"

@app.post("/api/complete")
async def complete(req: CompletionRequest):
    return StreamingResponse(
        stream_completion(req),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

The stop sequences are critical for code completion. Without them, the model might generate an entire function when you only wanted one line. Stopping at blank lines, class definitions, and function definitions produces focused completions.

Client-Side Streaming Renderer

The editor extension reads the SSE stream and renders tokens as inline ghost text that the user can accept with Tab.

async function fetchStreamingCompletion(
  context: EditorContext,
  signal: AbortSignal,
  onToken: (token: string) => void,
  onDone: () => void
): Promise<void> {
  const response = await fetch("/api/complete", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      file_path: context.filePath,
      language: context.language,
      prefix: context.prefix,
      suffix: context.suffix,
      cursor_line: context.cursorLine,
      cursor_column: context.cursorColumn,
      max_tokens: 256,
    }),
    signal,
  });

  if (!response.ok || !response.body) return;

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop() || "";

    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      const payload = JSON.parse(line.slice(6));

      if (payload.done) {
        onDone();
        return;
      }
      onToken(payload.token);
    }
  }
}

// Usage in editor extension
class InlineSuggestionProvider {
  private currentSuggestion = "";
  private debouncer = new SuggestionDebouncer();

  onTextChange(editor: any): void {
    const context = extractContext(editor);
    this.currentSuggestion = "";

    this.debouncer.requestCompletion(context, (ctx, signal) => {
      fetchStreamingCompletion(
        ctx,
        signal,
        (token) => {
          this.currentSuggestion += token;
          this.renderGhostText(editor, this.currentSuggestion);
        },
        () => {
          this.finalizeGhostText(editor, this.currentSuggestion);
        }
      );
    });
  }

  acceptSuggestion(editor: any): void {
    if (this.currentSuggestion) {
      insertText(editor, this.currentSuggestion);
      this.currentSuggestion = "";
    }
  }
}

Ghost text appears as greyed-out inline text at the cursor position. As tokens stream in, the ghost text grows character by character. The user presses Tab to accept or keeps typing to dismiss.

Explanation Endpoint

Beyond completions, a coding assistant should explain selected code. This uses a different prompt and longer generation length.

class ExplanationRequest(BaseModel):
    selected_code: str
    language: str
    file_path: str
    surrounding_context: str = ""

@app.post("/api/explain")
async def explain(req: ExplanationRequest):
    async def generate():
        system = (
            f"You are a {req.language} expert. Explain the selected code "
            "clearly and concisely. Focus on what the code does, why it is "
            "written this way, and any non-obvious behavior."
        )
        user_msg = f"Explain this code:\n\n{req.selected_code}"

        if req.surrounding_context:
            user_msg += f"\n\nSurrounding context:\n{req.surrounding_context}"

        async for chunk in call_llm_streaming(
            system=system,
            user=user_msg,
            max_tokens=1024,
            temperature=0.3,
        ):
            yield f"data: {json.dumps({'token': chunk, 'done': False})}\n\n"
        yield f"data: {json.dumps({'token': '', 'done': True})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

FAQ

How do you handle multi-line completions without overwhelming the user?

Limit the initial suggestion to one logical unit — a single statement, one function body, or one block. Use stop sequences at logical boundaries (blank lines, new function definitions) to prevent runaway generation. Show the first line immediately as ghost text, and if the user pauses on it (without accepting or dismissing), expand to show additional lines. This progressive disclosure pattern avoids cluttering the editor while still offering multi-line completions when the user signals interest.

What is the optimal debounce interval for code completions?

Research from developer tools shows 250-350ms works best for inline completions. Below 250ms, you send too many requests that get cancelled immediately as the user continues typing. Above 400ms, the suggestions feel sluggish. For different interaction types, use different intervals: 300ms for inline completions, 150ms for autocomplete dropdowns (where users expect faster response), and 500ms or more for heavy operations like code explanations or refactoring suggestions.

How do you manage context window limits when the file is very large?

Use a relevance-based context selection strategy instead of naive truncation. Prioritize: (1) the code immediately surrounding the cursor (highest relevance), (2) the function or class the cursor is inside, (3) import statements at the top of the file, (4) type definitions and interfaces used nearby, (5) other open files that import or are imported by the current file. This gives the model maximum relevant context within the token budget. Tools like tree-sitter can parse the AST to identify function boundaries and symbol references efficiently.

#CodeAssistant #Streaming #RealTimeAI #TypeScript #Python #AgenticAI #LearnAI #AIEngineering

Building a Real-Time AI Coding Assistant: Streaming Code Suggestions and Explanations

Architecture of a Real-Time Coding Assistant

Editor Integration: Capturing Context

Debouncing: Not Every Keystroke Needs a Request

Backend: Context-Aware Completion

Client-Side Streaming Renderer

Explanation Endpoint

FAQ

How do you handle multi-line completions without overwhelming the user?

What is the optimal debounce interval for code completions?

How do you manage context window limits when the file is very large?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding