Classification with Structured Outputs: Sentiment, Intent, and Category Detection
Implement text classification systems using structured outputs. Build sentiment analysis, intent detection, and hierarchical category classification with enum constraints, confidence scores, and multi-label support.
Classification as Structured Extraction
Text classification is a special case of structured output: instead of extracting free-form entities, you are constraining the model to pick from a fixed set of labels. Structured outputs turn this into a reliable, repeatable process by using enums to restrict possible values and numeric fields for confidence scores.
This approach gives you three things that prompt-only classification cannot: guaranteed valid labels, calibrated confidence scores, and multi-label support — all in a single API call.
Simple Sentiment Analysis
Start with the most basic classification task:
from pydantic import BaseModel, Field
from typing import Literal
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class SentimentResult(BaseModel):
sentiment: Literal["positive", "negative", "neutral"]
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str = Field(description="Brief explanation for the classification")
def classify_sentiment(text: str) -> SentimentResult:
return client.chat.completions.create(
model="gpt-4o",
response_model=SentimentResult,
messages=[
{
"role": "system",
"content": "Classify the sentiment of the given text. Be precise with confidence scores."
},
{"role": "user", "content": text}
],
)
result = classify_sentiment("The product works great but shipping took forever.")
print(result)
# SentimentResult(sentiment='neutral', confidence=0.65, reasoning='Mixed sentiment...')
The Literal type ensures the model can only return one of the three valid labels. No post-processing needed to handle misspellings or creative label names.
Intent Detection for Chatbots
Customer support systems need to route messages by intent. Define a comprehensive intent taxonomy:
from enum import Enum
from typing import List
class CustomerIntent(str, Enum):
billing_inquiry = "billing_inquiry"
technical_support = "technical_support"
account_management = "account_management"
product_information = "product_information"
complaint = "complaint"
cancellation = "cancellation"
feedback = "feedback"
general_question = "general_question"
class IntentClassification(BaseModel):
primary_intent: CustomerIntent
secondary_intent: CustomerIntent | None = None
confidence: float = Field(ge=0.0, le=1.0)
urgency: Literal["low", "medium", "high", "critical"]
suggested_department: str
def classify_intent(message: str) -> IntentClassification:
return client.chat.completions.create(
model="gpt-4o",
response_model=IntentClassification,
messages=[
{
"role": "system",
"content": (
"Classify the customer message intent. "
"Assign urgency based on customer frustration level "
"and business impact. Suggest the right department to handle it."
)
},
{"role": "user", "content": message}
],
)
result = classify_intent("I've been charged twice for my subscription and nobody is responding!")
print(result.primary_intent) # CustomerIntent.billing_inquiry
print(result.urgency) # "high"
print(result.suggested_department) # "Billing Support"
Multi-Label Classification
Some texts belong to multiple categories. Use a list of labels with confidence per label:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class CategoryScore(BaseModel):
category: str
confidence: float = Field(ge=0.0, le=1.0)
class MultiLabelResult(BaseModel):
labels: List[CategoryScore]
primary_category: str
@property
def above_threshold(self) -> List[CategoryScore]:
return [label for label in self.labels if label.confidence >= 0.5]
CATEGORIES = [
"technology", "business", "health", "sports",
"politics", "entertainment", "science", "education"
]
def classify_article(text: str) -> MultiLabelResult:
return client.chat.completions.create(
model="gpt-4o",
response_model=MultiLabelResult,
messages=[
{
"role": "system",
"content": (
f"Classify the article into these categories: {CATEGORIES}. "
"An article can belong to multiple categories. "
"Assign a confidence score to each relevant category. "
"Only include categories with confidence > 0.2."
)
},
{"role": "user", "content": text}
],
)
result = classify_article("AI startup raises $50M to revolutionize medical imaging")
for label in result.above_threshold:
print(f"{label.category}: {label.confidence:.2f}")
# technology: 0.90
# health: 0.85
# business: 0.75
Hierarchical Classification
Some taxonomies have parent-child relationships. Model them explicitly:
class HierarchicalCategory(BaseModel):
level_1: str = Field(description="Top-level category")
level_2: str = Field(description="Sub-category")
level_3: str | None = Field(default=None, description="Specific topic")
confidence: float = Field(ge=0.0, le=1.0)
TAXONOMY = {
"Technology": {
"Software": ["Web Development", "Mobile Apps", "DevOps", "AI/ML"],
"Hardware": ["Processors", "Storage", "Networking"],
},
"Business": {
"Finance": ["Investing", "Banking", "Cryptocurrency"],
"Management": ["Leadership", "Strategy", "Operations"],
},
}
def classify_hierarchical(text: str) -> HierarchicalCategory:
import json
return client.chat.completions.create(
model="gpt-4o",
response_model=HierarchicalCategory,
messages=[
{
"role": "system",
"content": (
f"Classify the text using this taxonomy:\n"
f"{json.dumps(TAXONOMY, indent=2)}\n"
"Each level must be a valid entry from the taxonomy."
)
},
{"role": "user", "content": text}
],
)
result = classify_hierarchical("How to deploy FastAPI with Kubernetes")
print(f"{result.level_1} > {result.level_2} > {result.level_3}")
# Technology > Software > DevOps
Batch Classification for Efficiency
When classifying hundreds of items, batch them to reduce overhead:
import asyncio
from openai import AsyncOpenAI
async_client = instructor.from_openai(AsyncOpenAI())
async def classify_batch(texts: List[str]) -> List[SentimentResult]:
tasks = [
async_client.chat.completions.create(
model="gpt-4o-mini", # Use mini for high-volume classification
response_model=SentimentResult,
messages=[
{"role": "system", "content": "Classify sentiment precisely."},
{"role": "user", "content": text}
],
)
for text in texts
]
return await asyncio.gather(*tasks)
# Process 100 reviews
reviews = ["Great product!", "Terrible experience.", "It was okay."] * 33 + ["Meh."]
results = asyncio.run(classify_batch(reviews))
Use gpt-4o-mini for classification tasks — it is 10-20x cheaper than gpt-4o and performs comparably on classification because the task is constrained by the schema.
FAQ
How do confidence scores from LLMs compare to traditional ML classifiers?
LLM confidence scores are self-reported and not calibrated the same way as logistic regression or softmax probabilities. They tend to be overconfident. Treat them as relative rankings rather than absolute probabilities. If you need calibrated scores, collect labeled data and fit a calibration curve on top of the LLM's raw scores.
Should I fine-tune a model for classification tasks?
For fewer than 20 categories with clear boundaries, prompting with structured outputs works well. For 50+ categories, domain-specific labels, or very high throughput needs, fine-tuning a smaller model is more cost-effective. The structured output approach is ideal for rapid prototyping and medium-volume production use.
How do I handle classification edge cases where the text does not fit any category?
Add an "other" or "unknown" option to your enum/Literal type and instruct the model to use it when confidence in all specific categories is below a threshold. Check the confidence score in your application code and route uncertain classifications to human review.
#Classification #SentimentAnalysis #IntentDetection #StructuredOutputs #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.