Building a Resume Parser with Structured Outputs: End-to-End Tutorial
Build a complete resume parsing pipeline from PDF to structured data. Covers PDF text extraction, schema design for work experience and education, LLM extraction, validation, and output formatting.
Why Build a Resume Parser?
Resume parsing is a classic structured extraction problem. Resumes contain predictable data types (names, dates, companies, skills) but wildly inconsistent formatting. Traditional regex-based parsers break on every new resume template. LLM-based parsers handle any format because they understand the content semantically, not syntactically.
In this tutorial, you will build a complete pipeline: PDF input, text extraction, LLM-powered structured extraction, validation, and clean JSON output.
Step 1: Define the Schema
Start by modeling what a parsed resume looks like:
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
from datetime import date
class ContactInfo(BaseModel):
full_name: str
email: Optional[str] = None
phone: Optional[str] = None
location: Optional[str] = Field(default=None, description="City, State or City, Country")
linkedin_url: Optional[str] = None
portfolio_url: Optional[str] = None
class WorkExperience(BaseModel):
company: str
title: str
start_date: Optional[str] = Field(default=None, description="YYYY-MM format")
end_date: Optional[str] = Field(default=None, description="YYYY-MM or 'Present'")
location: Optional[str] = None
description: Optional[str] = None
achievements: List[str] = Field(default_factory=list)
class Education(BaseModel):
institution: str
degree: Optional[str] = None
field_of_study: Optional[str] = None
start_date: Optional[str] = None
end_date: Optional[str] = None
gpa: Optional[float] = Field(default=None, ge=0.0, le=4.0)
class ParsedResume(BaseModel):
contact: ContactInfo
summary: Optional[str] = Field(default=None, description="Professional summary or objective")
work_experience: List[WorkExperience]
education: List[Education]
skills: List[str]
certifications: List[str] = Field(default_factory=list)
languages: List[str] = Field(default_factory=list)
Design choices matter here. Using Optional with None defaults means the model will not hallucinate values for missing fields. The YYYY-MM format for dates handles the common resume pattern where exact days are not listed.
Step 2: Extract Text from PDF
Use PyMuPDF (fitz) for reliable text extraction:
pip install pymupdf
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path: str) -> str:
"""Extract text from a PDF file, preserving basic structure."""
doc = fitz.open(pdf_path)
pages = []
for page in doc:
text = page.get_text("text")
pages.append(text)
doc.close()
return "\n\n".join(pages)
# Usage
resume_text = extract_text_from_pdf("resume.pdf")
print(f"Extracted {len(resume_text)} characters")
PyMuPDF handles most PDF formats, including those with columns, tables, and embedded fonts. For scanned PDFs (images), you would need OCR — add pytesseract as a preprocessing step.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Step 3: LLM Extraction
Send the extracted text to the LLM with your schema:
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
def parse_resume(resume_text: str) -> ParsedResume:
return client.chat.completions.create(
model="gpt-4o",
response_model=ParsedResume,
max_retries=3,
messages=[
{
"role": "system",
"content": (
"You are an expert resume parser. Extract structured data "
"from the resume text. Rules:\n"
"- Only extract information explicitly stated in the resume\n"
"- Use null for fields not present in the text\n"
"- List achievements as separate bullet points\n"
"- Normalize dates to YYYY-MM format when possible\n"
"- List skills as individual items, not comma-separated strings"
)
},
{"role": "user", "content": resume_text}
],
)
Step 4: Add Validation
Add validators that catch common LLM extraction errors:
from pydantic import model_validator
import re
class ValidatedResume(ParsedResume):
@model_validator(mode="after")
def validate_work_dates(self) -> "ValidatedResume":
"""Ensure work experience dates are chronologically valid."""
date_pattern = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")
for job in self.work_experience:
if job.start_date and not date_pattern.match(job.start_date):
if job.start_date.lower() != "present":
raise ValueError(
f"Invalid start_date format: '{job.start_date}' for {job.company}"
)
if job.end_date and job.end_date.lower() != "present":
if not date_pattern.match(job.end_date):
raise ValueError(
f"Invalid end_date format: '{job.end_date}' for {job.company}"
)
return self
@field_validator("skills")
@classmethod
def deduplicate_skills(cls, v: List[str]) -> List[str]:
"""Remove duplicate skills (case-insensitive)."""
seen = set()
unique = []
for skill in v:
normalized = skill.lower().strip()
if normalized not in seen:
seen.add(normalized)
unique.append(skill.strip())
return unique
When Instructor detects a validation error, it automatically retries the LLM call with the error message appended. The model sees "Invalid start_date format: 'March 2022'" and corrects it to "2022-03" on the next attempt.
Step 5: Output Formatting
Convert the parsed resume to your target format:
import json
def resume_to_json(parsed: ParsedResume) -> str:
"""Export parsed resume as formatted JSON."""
return parsed.model_dump_json(indent=2, exclude_none=True)
def resume_to_csv_row(parsed: ParsedResume) -> dict:
"""Flatten resume for CSV/spreadsheet export."""
return {
"name": parsed.contact.full_name,
"email": parsed.contact.email,
"phone": parsed.contact.phone,
"location": parsed.contact.location,
"latest_company": parsed.work_experience[0].company if parsed.work_experience else None,
"latest_title": parsed.work_experience[0].title if parsed.work_experience else None,
"years_experience": len(parsed.work_experience),
"highest_degree": parsed.education[0].degree if parsed.education else None,
"skills": ", ".join(parsed.skills),
"num_certifications": len(parsed.certifications),
}
Complete Pipeline
def process_resume(pdf_path: str) -> dict:
"""End-to-end resume processing pipeline."""
# Extract text
text = extract_text_from_pdf(pdf_path)
if len(text.strip()) < 50:
raise ValueError("PDF appears empty or unreadable. Try OCR.")
# Parse with LLM
parsed = parse_resume(text)
# Return structured output
return {
"parsed": parsed.model_dump(exclude_none=True),
"json": resume_to_json(parsed),
"csv_row": resume_to_csv_row(parsed),
}
result = process_resume("candidate_resume.pdf")
print(json.dumps(result["parsed"], indent=2))
FAQ
How accurate is LLM-based resume parsing compared to commercial parsers?
In tests on diverse resume formats, GPT-4o achieves 90-95% field-level accuracy on standard fields like name, email, and company names. Commercial parsers like Sovren or Textkernel achieve similar accuracy on standard formats but struggle more with creative or non-standard layouts where LLMs excel.
How do I handle multi-page resumes?
PyMuPDF concatenates all pages automatically. For resumes over 4 pages, the full text may exceed the model's optimal extraction context. In that case, extract contact info and summary from page 1, work experience from middle pages, and education/skills from the final section — then merge the results.
What about data privacy when sending resumes to OpenAI?
Resumes contain sensitive personal information. Use OpenAI's API data usage policy (API data is not used for training by default). For strict privacy requirements, run a local model via Ollama or vLLM with Instructor's OpenAI-compatible mode. This keeps all data on your infrastructure.
#ResumeParser #DataExtraction #PDF #StructuredOutputs #Tutorial #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.