Classification with Structured Outputs: Sentiment, Intent, and Category Detection

Classification as Structured Extraction

Text classification is a special case of structured output: instead of extracting free-form entities, you are constraining the model to pick from a fixed set of labels. Structured outputs turn this into a reliable, repeatable process by using enums to restrict possible values and numeric fields for confidence scores.

This approach gives you three things that prompt-only classification cannot: guaranteed valid labels, calibrated confidence scores, and multi-label support — all in a single API call.

Simple Sentiment Analysis

Start with the most basic classification task:

from pydantic import BaseModel, Field
from typing import Literal
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(description="Brief explanation for the classification")

def classify_sentiment(text: str) -> SentimentResult:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=SentimentResult,
        messages=[
            {
                "role": "system",
                "content": "Classify the sentiment of the given text. Be precise with confidence scores."
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_sentiment("The product works great but shipping took forever.")
print(result)
# SentimentResult(sentiment='neutral', confidence=0.65, reasoning='Mixed sentiment...')

The Literal type ensures the model can only return one of the three valid labels. No post-processing needed to handle misspellings or creative label names.

Intent Detection for Chatbots

Customer support systems need to route messages by intent. Define a comprehensive intent taxonomy:

from enum import Enum
from typing import List

class CustomerIntent(str, Enum):
    billing_inquiry = "billing_inquiry"
    technical_support = "technical_support"
    account_management = "account_management"
    product_information = "product_information"
    complaint = "complaint"
    cancellation = "cancellation"
    feedback = "feedback"
    general_question = "general_question"

class IntentClassification(BaseModel):
    primary_intent: CustomerIntent
    secondary_intent: CustomerIntent | None = None
    confidence: float = Field(ge=0.0, le=1.0)
    urgency: Literal["low", "medium", "high", "critical"]
    suggested_department: str

def classify_intent(message: str) -> IntentClassification:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=IntentClassification,
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the customer message intent. "
                    "Assign urgency based on customer frustration level "
                    "and business impact. Suggest the right department to handle it."
                )
            },
            {"role": "user", "content": message}
        ],
    )

result = classify_intent("I've been charged twice for my subscription and nobody is responding!")
print(result.primary_intent)      # CustomerIntent.billing_inquiry
print(result.urgency)             # "high"
print(result.suggested_department) # "Billing Support"

Multi-Label Classification

Some texts belong to multiple categories. Use a list of labels with confidence per label:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class CategoryScore(BaseModel):
    category: str
    confidence: float = Field(ge=0.0, le=1.0)

class MultiLabelResult(BaseModel):
    labels: List[CategoryScore]
    primary_category: str

    @property
    def above_threshold(self) -> List[CategoryScore]:
        return [label for label in self.labels if label.confidence >= 0.5]

CATEGORIES = [
    "technology", "business", "health", "sports",
    "politics", "entertainment", "science", "education"
]

def classify_article(text: str) -> MultiLabelResult:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=MultiLabelResult,
        messages=[
            {
                "role": "system",
                "content": (
                    f"Classify the article into these categories: {CATEGORIES}. "
                    "An article can belong to multiple categories. "
                    "Assign a confidence score to each relevant category. "
                    "Only include categories with confidence > 0.2."
                )
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_article("AI startup raises $50M to revolutionize medical imaging")
for label in result.above_threshold:
    print(f"{label.category}: {label.confidence:.2f}")
# technology: 0.90
# health: 0.85
# business: 0.75

Hierarchical Classification

Some taxonomies have parent-child relationships. Model them explicitly:

class HierarchicalCategory(BaseModel):
    level_1: str = Field(description="Top-level category")
    level_2: str = Field(description="Sub-category")
    level_3: str | None = Field(default=None, description="Specific topic")
    confidence: float = Field(ge=0.0, le=1.0)

TAXONOMY = {
    "Technology": {
        "Software": ["Web Development", "Mobile Apps", "DevOps", "AI/ML"],
        "Hardware": ["Processors", "Storage", "Networking"],
    },
    "Business": {
        "Finance": ["Investing", "Banking", "Cryptocurrency"],
        "Management": ["Leadership", "Strategy", "Operations"],
    },
}

def classify_hierarchical(text: str) -> HierarchicalCategory:
    import json
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=HierarchicalCategory,
        messages=[
            {
                "role": "system",
                "content": (
                    f"Classify the text using this taxonomy:\n"
                    f"{json.dumps(TAXONOMY, indent=2)}\n"
                    "Each level must be a valid entry from the taxonomy."
                )
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_hierarchical("How to deploy FastAPI with Kubernetes")
print(f"{result.level_1} > {result.level_2} > {result.level_3}")
# Technology > Software > DevOps

Batch Classification for Efficiency

When classifying hundreds of items, batch them to reduce overhead:

import asyncio
from openai import AsyncOpenAI

async_client = instructor.from_openai(AsyncOpenAI())

async def classify_batch(texts: List[str]) -> List[SentimentResult]:
    tasks = [
        async_client.chat.completions.create(
            model="gpt-4o-mini",  # Use mini for high-volume classification
            response_model=SentimentResult,
            messages=[
                {"role": "system", "content": "Classify sentiment precisely."},
                {"role": "user", "content": text}
            ],
        )
        for text in texts
    ]
    return await asyncio.gather(*tasks)

# Process 100 reviews
reviews = ["Great product!", "Terrible experience.", "It was okay."] * 33 + ["Meh."]
results = asyncio.run(classify_batch(reviews))

Use gpt-4o-mini for classification tasks — it is 10-20x cheaper than gpt-4o and performs comparably on classification because the task is constrained by the schema.

FAQ

How do confidence scores from LLMs compare to traditional ML classifiers?

LLM confidence scores are self-reported and not calibrated the same way as logistic regression or softmax probabilities. They tend to be overconfident. Treat them as relative rankings rather than absolute probabilities. If you need calibrated scores, collect labeled data and fit a calibration curve on top of the LLM's raw scores.

Should I fine-tune a model for classification tasks?

For fewer than 20 categories with clear boundaries, prompting with structured outputs works well. For 50+ categories, domain-specific labels, or very high throughput needs, fine-tuning a smaller model is more cost-effective. The structured output approach is ideal for rapid prototyping and medium-volume production use.

How do I handle classification edge cases where the text does not fit any category?

Add an "other" or "unknown" option to your enum/Literal type and instruct the model to use it when confidence in all specific categories is below a threshold. Check the confidence score in your application code and route uncertain classifications to human review.

#Classification #SentimentAnalysis #IntentDetection #StructuredOutputs #Python #AgenticAI #LearnAI #AIEngineering

Classification with Structured Outputs: Sentiment, Intent, and Category Detection

Classification as Structured Extraction

Simple Sentiment Analysis

Intent Detection for Chatbots

Multi-Label Classification

Hierarchical Classification

Batch Classification for Efficiency

FAQ

How do confidence scores from LLMs compare to traditional ML classifiers?

Should I fine-tune a model for classification tasks?

How do I handle classification edge cases where the text does not fit any category?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding