Building a Diagram Understanding Agent: Flowcharts, Architecture Diagrams, and Charts

Why Diagram Understanding Is Valuable

Technical documentation is full of diagrams — flowcharts describing business processes, architecture diagrams showing system components, sequence diagrams illustrating API interactions, and data flow charts mapping pipelines. An agent that can read and understand these diagrams can answer questions about system architecture, generate code from flowcharts, identify missing components, and convert visual documentation into machine-readable formats.

Diagram Classification

The first step is identifying what type of diagram the agent is looking at, because each type requires a different extraction strategy:

import openai
import base64
from pydantic import BaseModel
from enum import Enum


class DiagramType(str, Enum):
    FLOWCHART = "flowchart"
    ARCHITECTURE = "architecture"
    SEQUENCE = "sequence"
    ER_DIAGRAM = "er_diagram"
    DATA_FLOW = "data_flow"
    ORG_CHART = "org_chart"
    CHART = "chart"  # bar, line, pie
    UNKNOWN = "unknown"


class DiagramClassification(BaseModel):
    diagram_type: DiagramType
    confidence: float
    description: str


async def classify_diagram(
    image_bytes: bytes, client: openai.AsyncOpenAI
) -> DiagramClassification:
    """Classify the type of diagram in an image."""
    b64 = base64.b64encode(image_bytes).decode()

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify this diagram. Identify the type, "
                    "your confidence level (0-1), and a brief "
                    "description of what the diagram shows."
                ),
            },
            {
                "role": "user",
                "content": [{
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    },
                }],
            },
        ],
        response_format=DiagramClassification,
    )
    return response.choices[0].message.parsed

Extracting Elements and Relationships

Once classified, extract the structural components. For flowcharts, this means nodes and edges. For architecture diagrams, it means components and connections:

class DiagramNode(BaseModel):
    id: str
    label: str
    node_type: str  # process, decision, start, end, component
    properties: dict = {}


class DiagramEdge(BaseModel):
    source_id: str
    target_id: str
    label: str = ""
    edge_type: str = "directed"  # directed, bidirectional


class DiagramStructure(BaseModel):
    nodes: list[DiagramNode]
    edges: list[DiagramEdge]
    title: str = ""
    notes: list[str] = []


async def extract_structure(
    image_bytes: bytes,
    diagram_type: DiagramType,
    client: openai.AsyncOpenAI,
) -> DiagramStructure:
    """Extract nodes and edges from a diagram."""
    b64 = base64.b64encode(image_bytes).decode()

    type_hints = {
        DiagramType.FLOWCHART: (
            "This is a flowchart. Extract all process steps, "
            "decision points, start/end nodes, and the arrows "
            "connecting them. Use node types: process, decision, "
            "start, end, subprocess."
        ),
        DiagramType.ARCHITECTURE: (
            "This is an architecture diagram. Extract all system "
            "components (services, databases, queues, load "
            "balancers, etc.) and their connections. Use node "
            "types: service, database, queue, cache, gateway, "
            "client, external."
        ),
        DiagramType.SEQUENCE: (
            "This is a sequence diagram. Extract all participants "
            "as nodes and messages as edges in chronological order."
        ),
    }

    hint = type_hints.get(
        diagram_type,
        "Extract all elements and their relationships.",
    )

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": hint},
            {
                "role": "user",
                "content": [{
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    },
                }],
            },
        ],
        response_format=DiagramStructure,
    )
    return response.choices[0].message.parsed

Converting Diagrams to Code

One of the most powerful capabilities is converting a visual diagram into executable code or infrastructure-as-code:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def diagram_to_mermaid(
    structure: DiagramStructure,
    diagram_type: DiagramType,
) -> str:
    """Convert extracted diagram structure to Mermaid syntax."""
    if diagram_type == DiagramType.FLOWCHART:
        lines = ["flowchart TD"]
        for node in structure.nodes:
            shape = {
                "decision": f"{node.id}{{{node.label}}}",
                "start": f"{node.id}([{node.label}])",
                "end": f"{node.id}([{node.label}])",
                "process": f"{node.id}[{node.label}]",
            }.get(node.node_type, f"{node.id}[{node.label}]")
            lines.append(f"    {shape}")

        for edge in structure.edges:
            if edge.label:
                lines.append(
                    f"    {edge.source_id} -->|{edge.label}| "
                    f"{edge.target_id}"
                )
            else:
                lines.append(
                    f"    {edge.source_id} --> {edge.target_id}"
                )

        return "\n".join(lines)

    elif diagram_type == DiagramType.ARCHITECTURE:
        lines = ["flowchart LR"]
        for node in structure.nodes:
            icon = {
                "database": f"{node.id}[({node.label})]",
                "queue": f"{node.id}>{node.label}]",
                "service": f"{node.id}[{node.label}]",
            }.get(node.node_type, f"{node.id}[{node.label}]")
            lines.append(f"    {icon}")

        for edge in structure.edges:
            arrow = (
                " <--> " if edge.edge_type == "bidirectional"
                else " --> "
            )
            lines.append(
                f"    {edge.source_id}{arrow}{edge.target_id}"
            )

        return "\n".join(lines)

    return "# Unsupported diagram type for Mermaid conversion"

The Diagram Agent

class DiagramUnderstandingAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def analyze(self, image_bytes: bytes) -> dict:
        classification = await classify_diagram(
            image_bytes, self.client
        )
        structure = await extract_structure(
            image_bytes, classification.diagram_type, self.client
        )
        mermaid = await diagram_to_mermaid(
            structure, classification.diagram_type
        )

        return {
            "type": classification.diagram_type.value,
            "description": classification.description,
            "nodes": len(structure.nodes),
            "edges": len(structure.edges),
            "structure": structure.model_dump(),
            "mermaid_code": mermaid,
        }

    async def ask(
        self, image_bytes: bytes, question: str
    ) -> str:
        b64 = base64.b64encode(image_bytes).decode()
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                ],
            }],
        )
        return response.choices[0].message.content

FAQ

How accurate is GPT-4o at extracting diagram structures compared to dedicated diagram parsers?

For clean, well-formatted diagrams, GPT-4o extracts nodes and edges with approximately 90% accuracy. It excels at understanding context and labels but can miss precise spatial relationships in dense diagrams. Dedicated parsers like those in draw.io or Lucidchart have access to the underlying XML and achieve near-perfect accuracy on their own formats. Use vision models when you only have a screenshot or image of the diagram.

Can this agent handle hand-drawn diagrams on whiteboards?

Yes, with reduced accuracy. GPT-4o can interpret hand-drawn flowcharts and architecture sketches, identifying boxes, arrows, and labels even when the drawing is rough. For best results, ensure the whiteboard photo has good lighting, minimal glare, and the handwriting is reasonably legible. The classification step still works well because the overall layout patterns — boxes connected by arrows — are recognizable regardless of drawing quality.

How do I validate that the extracted structure is correct?

Convert the extracted structure to Mermaid or Graphviz and render it visually. Compare the rendered output against the original diagram. You can also automate validation by checking that every node has at least one edge (no orphan nodes), decision nodes have exactly two outgoing edges, and start nodes have no incoming edges. These structural constraints catch most extraction errors.

#DiagramAnalysis #Flowcharts #ArchitectureDiagrams #VisualUnderstanding #Python #AgenticAI #LearnAI #AIEngineering

Building a Diagram Understanding Agent: Flowcharts, Architecture Diagrams, and Charts

Why Diagram Understanding Is Valuable

Diagram Classification

Extracting Elements and Relationships

Converting Diagrams to Code

The Diagram Agent

FAQ

How accurate is GPT-4o at extracting diagram structures compared to dedicated diagram parsers?

Can this agent handle hand-drawn diagrams on whiteboards?

How do I validate that the extracted structure is correct?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding