Building a Real-Time AI Coding Assistant: Streaming Code Suggestions and Explanations
Build a real-time AI coding assistant that integrates with code editors, extracts context intelligently, debounces user input, and streams code suggestions and explanations with low latency.
Architecture of a Real-Time Coding Assistant
A real-time coding assistant must balance three competing demands: responsiveness (suggestions should appear within hundreds of milliseconds of a keystroke), accuracy (suggestions need enough context to be relevant), and efficiency (you cannot send an LLM request on every single character typed). The architecture solves this with four components: an editor integration layer that captures context, a debouncing mechanism that batches rapid inputs, a context extraction pipeline that selects the most relevant code, and a streaming response handler that renders suggestions progressively.
Editor Integration: Capturing Context
The editor extension captures the cursor position, surrounding code, file path, and language. This context travels to the backend with each suggestion request.
interface EditorContext {
filePath: string;
language: string;
cursorLine: number;
cursorColumn: number;
prefix: string; // Code before cursor (up to N lines)
suffix: string; // Code after cursor (up to N lines)
selection: string; // Currently selected text, if any
openFiles: string[]; // Other open file paths for cross-file context
}
function extractContext(
editor: { document: any; selection: any },
contextLines: number = 50
): EditorContext {
const doc = editor.document;
const pos = editor.selection.active;
const totalLines = doc.lineCount;
const prefixStart = Math.max(0, pos.line - contextLines);
const suffixEnd = Math.min(totalLines, pos.line + contextLines);
const prefix = doc.getText({
start: { line: prefixStart, character: 0 },
end: { line: pos.line, character: pos.character },
});
const suffix = doc.getText({
start: { line: pos.line, character: pos.character },
end: { line: suffixEnd, character: 0 },
});
return {
filePath: doc.fileName,
language: doc.languageId,
cursorLine: pos.line,
cursorColumn: pos.character,
prefix,
suffix,
selection: doc.getText(editor.selection),
openFiles: getOpenFilePaths(),
};
}
The prefix and suffix together form the "fill-in-the-middle" (FIM) context that most code completion models use. Limiting to 50 lines in each direction keeps the request payload manageable while providing enough context for accurate suggestions.
Debouncing: Not Every Keystroke Needs a Request
Sending a request on every keystroke would flood the server and waste tokens. Debouncing waits for a pause in typing before triggering a request. The right debounce interval depends on the interaction type.
class SuggestionDebouncer {
private timer: NodeJS.Timeout | null = null;
private abortController: AbortController | null = null;
private lastRequestTime = 0;
constructor(
private completionDelay: number = 300, // ms after typing stops
private explainDelay: number = 500, // longer for explanation requests
) {}
requestCompletion(
context: EditorContext,
callback: (ctx: EditorContext, signal: AbortSignal) => void
): void {
// Cancel any pending request
if (this.timer) clearTimeout(this.timer);
if (this.abortController) this.abortController.abort();
this.abortController = new AbortController();
const signal = this.abortController.signal;
this.timer = setTimeout(() => {
this.lastRequestTime = Date.now();
callback(context, signal);
}, this.completionDelay);
}
requestExplanation(
context: EditorContext,
callback: (ctx: EditorContext, signal: AbortSignal) => void
): void {
if (this.timer) clearTimeout(this.timer);
if (this.abortController) this.abortController.abort();
this.abortController = new AbortController();
const signal = this.abortController.signal;
this.timer = setTimeout(() => {
callback(context, signal);
}, this.explainDelay);
}
cancel(): void {
if (this.timer) clearTimeout(this.timer);
if (this.abortController) this.abortController.abort();
}
}
Two key details: the AbortController cancels in-flight HTTP requests when the user continues typing (so stale results never appear), and the delay is shorter for completions (300ms) than explanations (500ms) because completions are expected inline while the user types, whereas explanations are explicit user actions.
Backend: Context-Aware Completion
The backend receives the editor context, enriches it with cross-file context if available, and streams the completion.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Optional
import json
app = FastAPI()
class CompletionRequest(BaseModel):
file_path: str
language: str
prefix: str
suffix: str
cursor_line: int
cursor_column: int
selection: Optional[str] = None
open_files: list[str] = []
max_tokens: int = 256
def build_fim_prompt(req: CompletionRequest) -> dict:
"""Build a fill-in-the-middle prompt for code completion."""
system_prompt = (
f"You are a code completion engine for {req.language}. "
"Complete the code at the cursor position. "
"Output ONLY the completion, no explanation, no markdown."
)
# Trim context to fit token budget
max_context_chars = 8000
prefix = req.prefix[-max_context_chars:]
suffix = req.suffix[:max_context_chars // 2]
user_prompt = f"<prefix>{prefix}<cursor>{suffix}<suffix>"
return {"system": system_prompt, "user": user_prompt}
async def stream_completion(req: CompletionRequest):
prompt = build_fim_prompt(req)
async for chunk in call_llm_streaming(
system=prompt["system"],
user=prompt["user"],
max_tokens=req.max_tokens,
temperature=0.1, # Low temperature for deterministic completions
stop=["\n\n", "\nclass ", "\ndef ", "\n#"], # Stop at logical boundaries
):
data = json.dumps({"token": chunk, "done": False})
yield f"data: {data}\n\n"
yield f"data: {json.dumps({'token': '', 'done': True})}\n\n"
@app.post("/api/complete")
async def complete(req: CompletionRequest):
return StreamingResponse(
stream_completion(req),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
The stop sequences are critical for code completion. Without them, the model might generate an entire function when you only wanted one line. Stopping at blank lines, class definitions, and function definitions produces focused completions.
Client-Side Streaming Renderer
The editor extension reads the SSE stream and renders tokens as inline ghost text that the user can accept with Tab.
async function fetchStreamingCompletion(
context: EditorContext,
signal: AbortSignal,
onToken: (token: string) => void,
onDone: () => void
): Promise<void> {
const response = await fetch("/api/complete", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
file_path: context.filePath,
language: context.language,
prefix: context.prefix,
suffix: context.suffix,
cursor_line: context.cursorLine,
cursor_column: context.cursorColumn,
max_tokens: 256,
}),
signal,
});
if (!response.ok || !response.body) return;
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop() || "";
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const payload = JSON.parse(line.slice(6));
if (payload.done) {
onDone();
return;
}
onToken(payload.token);
}
}
}
// Usage in editor extension
class InlineSuggestionProvider {
private currentSuggestion = "";
private debouncer = new SuggestionDebouncer();
onTextChange(editor: any): void {
const context = extractContext(editor);
this.currentSuggestion = "";
this.debouncer.requestCompletion(context, (ctx, signal) => {
fetchStreamingCompletion(
ctx,
signal,
(token) => {
this.currentSuggestion += token;
this.renderGhostText(editor, this.currentSuggestion);
},
() => {
this.finalizeGhostText(editor, this.currentSuggestion);
}
);
});
}
acceptSuggestion(editor: any): void {
if (this.currentSuggestion) {
insertText(editor, this.currentSuggestion);
this.currentSuggestion = "";
}
}
}
Ghost text appears as greyed-out inline text at the cursor position. As tokens stream in, the ghost text grows character by character. The user presses Tab to accept or keeps typing to dismiss.
Explanation Endpoint
Beyond completions, a coding assistant should explain selected code. This uses a different prompt and longer generation length.
class ExplanationRequest(BaseModel):
selected_code: str
language: str
file_path: str
surrounding_context: str = ""
@app.post("/api/explain")
async def explain(req: ExplanationRequest):
async def generate():
system = (
f"You are a {req.language} expert. Explain the selected code "
"clearly and concisely. Focus on what the code does, why it is "
"written this way, and any non-obvious behavior."
)
user_msg = f"Explain this code:\n\n{req.selected_code}"
if req.surrounding_context:
user_msg += f"\n\nSurrounding context:\n{req.surrounding_context}"
async for chunk in call_llm_streaming(
system=system,
user=user_msg,
max_tokens=1024,
temperature=0.3,
):
yield f"data: {json.dumps({'token': chunk, 'done': False})}\n\n"
yield f"data: {json.dumps({'token': '', 'done': True})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
FAQ
How do you handle multi-line completions without overwhelming the user?
Limit the initial suggestion to one logical unit — a single statement, one function body, or one block. Use stop sequences at logical boundaries (blank lines, new function definitions) to prevent runaway generation. Show the first line immediately as ghost text, and if the user pauses on it (without accepting or dismissing), expand to show additional lines. This progressive disclosure pattern avoids cluttering the editor while still offering multi-line completions when the user signals interest.
What is the optimal debounce interval for code completions?
Research from developer tools shows 250-350ms works best for inline completions. Below 250ms, you send too many requests that get cancelled immediately as the user continues typing. Above 400ms, the suggestions feel sluggish. For different interaction types, use different intervals: 300ms for inline completions, 150ms for autocomplete dropdowns (where users expect faster response), and 500ms or more for heavy operations like code explanations or refactoring suggestions.
How do you manage context window limits when the file is very large?
Use a relevance-based context selection strategy instead of naive truncation. Prioritize: (1) the code immediately surrounding the cursor (highest relevance), (2) the function or class the cursor is inside, (3) import statements at the top of the file, (4) type definitions and interfaces used nearby, (5) other open files that import or are imported by the current file. This gives the model maximum relevant context within the token budget. Tools like tree-sitter can parse the AST to identify function boundaries and symbol references efficiently.
#CodeAssistant #Streaming #RealTimeAI #TypeScript #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.