Skip to content
Learn Agentic AI14 min read0 views

Building a Resume Parser with Structured Outputs: End-to-End Tutorial

Build a complete resume parsing pipeline from PDF to structured data. Covers PDF text extraction, schema design for work experience and education, LLM extraction, validation, and output formatting.

Why Build a Resume Parser?

Resume parsing is a classic structured extraction problem. Resumes contain predictable data types (names, dates, companies, skills) but wildly inconsistent formatting. Traditional regex-based parsers break on every new resume template. LLM-based parsers handle any format because they understand the content semantically, not syntactically.

In this tutorial, you will build a complete pipeline: PDF input, text extraction, LLM-powered structured extraction, validation, and clean JSON output.

Step 1: Define the Schema

Start by modeling what a parsed resume looks like:

from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
from datetime import date

class ContactInfo(BaseModel):
    full_name: str
    email: Optional[str] = None
    phone: Optional[str] = None
    location: Optional[str] = Field(default=None, description="City, State or City, Country")
    linkedin_url: Optional[str] = None
    portfolio_url: Optional[str] = None

class WorkExperience(BaseModel):
    company: str
    title: str
    start_date: Optional[str] = Field(default=None, description="YYYY-MM format")
    end_date: Optional[str] = Field(default=None, description="YYYY-MM or 'Present'")
    location: Optional[str] = None
    description: Optional[str] = None
    achievements: List[str] = Field(default_factory=list)

class Education(BaseModel):
    institution: str
    degree: Optional[str] = None
    field_of_study: Optional[str] = None
    start_date: Optional[str] = None
    end_date: Optional[str] = None
    gpa: Optional[float] = Field(default=None, ge=0.0, le=4.0)

class ParsedResume(BaseModel):
    contact: ContactInfo
    summary: Optional[str] = Field(default=None, description="Professional summary or objective")
    work_experience: List[WorkExperience]
    education: List[Education]
    skills: List[str]
    certifications: List[str] = Field(default_factory=list)
    languages: List[str] = Field(default_factory=list)

Design choices matter here. Using Optional with None defaults means the model will not hallucinate values for missing fields. The YYYY-MM format for dates handles the common resume pattern where exact days are not listed.

Step 2: Extract Text from PDF

Use PyMuPDF (fitz) for reliable text extraction:

pip install pymupdf
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from a PDF file, preserving basic structure."""
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(text)
    doc.close()
    return "\n\n".join(pages)

# Usage
resume_text = extract_text_from_pdf("resume.pdf")
print(f"Extracted {len(resume_text)} characters")

PyMuPDF handles most PDF formats, including those with columns, tables, and embedded fonts. For scanned PDFs (images), you would need OCR — add pytesseract as a preprocessing step.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Step 3: LLM Extraction

Send the extracted text to the LLM with your schema:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

def parse_resume(resume_text: str) -> ParsedResume:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=ParsedResume,
        max_retries=3,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert resume parser. Extract structured data "
                    "from the resume text. Rules:\n"
                    "- Only extract information explicitly stated in the resume\n"
                    "- Use null for fields not present in the text\n"
                    "- List achievements as separate bullet points\n"
                    "- Normalize dates to YYYY-MM format when possible\n"
                    "- List skills as individual items, not comma-separated strings"
                )
            },
            {"role": "user", "content": resume_text}
        ],
    )

Step 4: Add Validation

Add validators that catch common LLM extraction errors:

from pydantic import model_validator
import re

class ValidatedResume(ParsedResume):

    @model_validator(mode="after")
    def validate_work_dates(self) -> "ValidatedResume":
        """Ensure work experience dates are chronologically valid."""
        date_pattern = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")

        for job in self.work_experience:
            if job.start_date and not date_pattern.match(job.start_date):
                if job.start_date.lower() != "present":
                    raise ValueError(
                        f"Invalid start_date format: '{job.start_date}' for {job.company}"
                    )
            if job.end_date and job.end_date.lower() != "present":
                if not date_pattern.match(job.end_date):
                    raise ValueError(
                        f"Invalid end_date format: '{job.end_date}' for {job.company}"
                    )
        return self

    @field_validator("skills")
    @classmethod
    def deduplicate_skills(cls, v: List[str]) -> List[str]:
        """Remove duplicate skills (case-insensitive)."""
        seen = set()
        unique = []
        for skill in v:
            normalized = skill.lower().strip()
            if normalized not in seen:
                seen.add(normalized)
                unique.append(skill.strip())
        return unique

When Instructor detects a validation error, it automatically retries the LLM call with the error message appended. The model sees "Invalid start_date format: 'March 2022'" and corrects it to "2022-03" on the next attempt.

Step 5: Output Formatting

Convert the parsed resume to your target format:

import json

def resume_to_json(parsed: ParsedResume) -> str:
    """Export parsed resume as formatted JSON."""
    return parsed.model_dump_json(indent=2, exclude_none=True)

def resume_to_csv_row(parsed: ParsedResume) -> dict:
    """Flatten resume for CSV/spreadsheet export."""
    return {
        "name": parsed.contact.full_name,
        "email": parsed.contact.email,
        "phone": parsed.contact.phone,
        "location": parsed.contact.location,
        "latest_company": parsed.work_experience[0].company if parsed.work_experience else None,
        "latest_title": parsed.work_experience[0].title if parsed.work_experience else None,
        "years_experience": len(parsed.work_experience),
        "highest_degree": parsed.education[0].degree if parsed.education else None,
        "skills": ", ".join(parsed.skills),
        "num_certifications": len(parsed.certifications),
    }

Complete Pipeline

def process_resume(pdf_path: str) -> dict:
    """End-to-end resume processing pipeline."""
    # Extract text
    text = extract_text_from_pdf(pdf_path)

    if len(text.strip()) < 50:
        raise ValueError("PDF appears empty or unreadable. Try OCR.")

    # Parse with LLM
    parsed = parse_resume(text)

    # Return structured output
    return {
        "parsed": parsed.model_dump(exclude_none=True),
        "json": resume_to_json(parsed),
        "csv_row": resume_to_csv_row(parsed),
    }

result = process_resume("candidate_resume.pdf")
print(json.dumps(result["parsed"], indent=2))

FAQ

How accurate is LLM-based resume parsing compared to commercial parsers?

In tests on diverse resume formats, GPT-4o achieves 90-95% field-level accuracy on standard fields like name, email, and company names. Commercial parsers like Sovren or Textkernel achieve similar accuracy on standard formats but struggle more with creative or non-standard layouts where LLMs excel.

How do I handle multi-page resumes?

PyMuPDF concatenates all pages automatically. For resumes over 4 pages, the full text may exceed the model's optimal extraction context. In that case, extract contact info and summary from page 1, work experience from middle pages, and education/skills from the final section — then merge the results.

What about data privacy when sending resumes to OpenAI?

Resumes contain sensitive personal information. Use OpenAI's API data usage policy (API data is not used for training by default). For strict privacy requirements, run a local model via Ollama or vLLM with Instructor's OpenAI-compatible mode. This keeps all data on your infrastructure.


#ResumeParser #DataExtraction #PDF #StructuredOutputs #Tutorial #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.