Building a Diagram Understanding Agent: Flowcharts, Architecture Diagrams, and Charts
Create an AI agent that classifies diagram types, extracts elements and relationships from flowcharts and architecture diagrams, and converts visual diagrams into structured data and code representations.
Why Diagram Understanding Is Valuable
Technical documentation is full of diagrams — flowcharts describing business processes, architecture diagrams showing system components, sequence diagrams illustrating API interactions, and data flow charts mapping pipelines. An agent that can read and understand these diagrams can answer questions about system architecture, generate code from flowcharts, identify missing components, and convert visual documentation into machine-readable formats.
Diagram Classification
The first step is identifying what type of diagram the agent is looking at, because each type requires a different extraction strategy:
import openai
import base64
from pydantic import BaseModel
from enum import Enum
class DiagramType(str, Enum):
FLOWCHART = "flowchart"
ARCHITECTURE = "architecture"
SEQUENCE = "sequence"
ER_DIAGRAM = "er_diagram"
DATA_FLOW = "data_flow"
ORG_CHART = "org_chart"
CHART = "chart" # bar, line, pie
UNKNOWN = "unknown"
class DiagramClassification(BaseModel):
diagram_type: DiagramType
confidence: float
description: str
async def classify_diagram(
image_bytes: bytes, client: openai.AsyncOpenAI
) -> DiagramClassification:
"""Classify the type of diagram in an image."""
b64 = base64.b64encode(image_bytes).decode()
response = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Classify this diagram. Identify the type, "
"your confidence level (0-1), and a brief "
"description of what the diagram shows."
),
},
{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}"
},
}],
},
],
response_format=DiagramClassification,
)
return response.choices[0].message.parsed
Extracting Elements and Relationships
Once classified, extract the structural components. For flowcharts, this means nodes and edges. For architecture diagrams, it means components and connections:
class DiagramNode(BaseModel):
id: str
label: str
node_type: str # process, decision, start, end, component
properties: dict = {}
class DiagramEdge(BaseModel):
source_id: str
target_id: str
label: str = ""
edge_type: str = "directed" # directed, bidirectional
class DiagramStructure(BaseModel):
nodes: list[DiagramNode]
edges: list[DiagramEdge]
title: str = ""
notes: list[str] = []
async def extract_structure(
image_bytes: bytes,
diagram_type: DiagramType,
client: openai.AsyncOpenAI,
) -> DiagramStructure:
"""Extract nodes and edges from a diagram."""
b64 = base64.b64encode(image_bytes).decode()
type_hints = {
DiagramType.FLOWCHART: (
"This is a flowchart. Extract all process steps, "
"decision points, start/end nodes, and the arrows "
"connecting them. Use node types: process, decision, "
"start, end, subprocess."
),
DiagramType.ARCHITECTURE: (
"This is an architecture diagram. Extract all system "
"components (services, databases, queues, load "
"balancers, etc.) and their connections. Use node "
"types: service, database, queue, cache, gateway, "
"client, external."
),
DiagramType.SEQUENCE: (
"This is a sequence diagram. Extract all participants "
"as nodes and messages as edges in chronological order."
),
}
hint = type_hints.get(
diagram_type,
"Extract all elements and their relationships.",
)
response = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": hint},
{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}"
},
}],
},
],
response_format=DiagramStructure,
)
return response.choices[0].message.parsed
Converting Diagrams to Code
One of the most powerful capabilities is converting a visual diagram into executable code or infrastructure-as-code:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def diagram_to_mermaid(
structure: DiagramStructure,
diagram_type: DiagramType,
) -> str:
"""Convert extracted diagram structure to Mermaid syntax."""
if diagram_type == DiagramType.FLOWCHART:
lines = ["flowchart TD"]
for node in structure.nodes:
shape = {
"decision": f"{node.id}{{{node.label}}}",
"start": f"{node.id}([{node.label}])",
"end": f"{node.id}([{node.label}])",
"process": f"{node.id}[{node.label}]",
}.get(node.node_type, f"{node.id}[{node.label}]")
lines.append(f" {shape}")
for edge in structure.edges:
if edge.label:
lines.append(
f" {edge.source_id} -->|{edge.label}| "
f"{edge.target_id}"
)
else:
lines.append(
f" {edge.source_id} --> {edge.target_id}"
)
return "\n".join(lines)
elif diagram_type == DiagramType.ARCHITECTURE:
lines = ["flowchart LR"]
for node in structure.nodes:
icon = {
"database": f"{node.id}[({node.label})]",
"queue": f"{node.id}>{node.label}]",
"service": f"{node.id}[{node.label}]",
}.get(node.node_type, f"{node.id}[{node.label}]")
lines.append(f" {icon}")
for edge in structure.edges:
arrow = (
" <--> " if edge.edge_type == "bidirectional"
else " --> "
)
lines.append(
f" {edge.source_id}{arrow}{edge.target_id}"
)
return "\n".join(lines)
return "# Unsupported diagram type for Mermaid conversion"
The Diagram Agent
class DiagramUnderstandingAgent:
def __init__(self):
self.client = openai.AsyncOpenAI()
async def analyze(self, image_bytes: bytes) -> dict:
classification = await classify_diagram(
image_bytes, self.client
)
structure = await extract_structure(
image_bytes, classification.diagram_type, self.client
)
mermaid = await diagram_to_mermaid(
structure, classification.diagram_type
)
return {
"type": classification.diagram_type.value,
"description": classification.description,
"nodes": len(structure.nodes),
"edges": len(structure.edges),
"structure": structure.model_dump(),
"mermaid_code": mermaid,
}
async def ask(
self, image_bytes: bytes, question: str
) -> str:
b64 = base64.b64encode(image_bytes).decode()
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}"
},
},
],
}],
)
return response.choices[0].message.content
FAQ
How accurate is GPT-4o at extracting diagram structures compared to dedicated diagram parsers?
For clean, well-formatted diagrams, GPT-4o extracts nodes and edges with approximately 90% accuracy. It excels at understanding context and labels but can miss precise spatial relationships in dense diagrams. Dedicated parsers like those in draw.io or Lucidchart have access to the underlying XML and achieve near-perfect accuracy on their own formats. Use vision models when you only have a screenshot or image of the diagram.
Can this agent handle hand-drawn diagrams on whiteboards?
Yes, with reduced accuracy. GPT-4o can interpret hand-drawn flowcharts and architecture sketches, identifying boxes, arrows, and labels even when the drawing is rough. For best results, ensure the whiteboard photo has good lighting, minimal glare, and the handwriting is reasonably legible. The classification step still works well because the overall layout patterns — boxes connected by arrows — are recognizable regardless of drawing quality.
How do I validate that the extracted structure is correct?
Convert the extracted structure to Mermaid or Graphviz and render it visually. Compare the rendered output against the original diagram. You can also automate validation by checking that every node has at least one edge (no orphan nodes), decision nodes have exactly two outgoing edges, and start nodes have no incoming edges. These structural constraints catch most extraction errors.
#DiagramAnalysis #Flowcharts #ArchitectureDiagrams #VisualUnderstanding #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.