AI Quiz Generator Agent: Creating Assessments from Any Content Source

The Problem with Manual Quiz Creation

Instructors spend hours crafting quiz questions that test the right concepts at the right difficulty level. A single well-written multiple-choice question requires identifying the key concept, writing a clear stem, creating one correct answer, and generating plausible distractors — wrong answers that would tempt a student with a specific misconception. Scaling this process across an entire course is time-consuming and error-prone.

An AI quiz generator agent automates this by analyzing source content, identifying testable concepts, and producing questions across multiple formats with calibrated difficulty. The agent does not just rephrase sentences as questions — it understands the underlying knowledge structure and generates assessments that probe genuine understanding.

Question Type Definitions

Start by defining a structured output format for different question types:

from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional

class QuestionType(str, Enum):
    MULTIPLE_CHOICE = "multiple_choice"
    TRUE_FALSE = "true_false"
    SHORT_ANSWER = "short_answer"
    FILL_IN_BLANK = "fill_in_blank"

class Difficulty(str, Enum):
    RECALL = "recall"           # Remember facts
    UNDERSTANDING = "understanding"  # Explain concepts
    APPLICATION = "application"      # Apply to new situations
    ANALYSIS = "analysis"            # Break down and evaluate

class Distractor(BaseModel):
    text: str
    misconception: str = Field(
        description="The specific misconception this wrong answer targets"
    )

class QuizQuestion(BaseModel):
    question: str
    question_type: QuestionType
    difficulty: Difficulty
    correct_answer: str
    distractors: list[Distractor] = []
    explanation: str = Field(
        description="Why the correct answer is right"
    )
    source_concept: str = Field(
        description="The concept from the source material being tested"
    )
    bloom_level: str = Field(
        description="Bloom taxonomy level: remember, understand, apply, "
                    "analyze, evaluate, create"
    )

class QuizOutput(BaseModel):
    title: str
    questions: list[QuizQuestion]
    coverage_summary: str

Content Analysis Pipeline

Before generating questions, the agent needs to extract key concepts from the source material. This two-stage approach produces much better questions than generating directly from raw text:

from agents import Agent, Runner
import json

concept_extractor = Agent(
    name="Concept Extractor",
    instructions="""Analyze the provided educational content and extract
a structured list of key concepts. For each concept, identify:

1. The concept name
2. A one-sentence definition
3. Prerequisites (other concepts it depends on)
4. Common misconceptions students have about it
5. The cognitive level required to understand it (remember/understand/
   apply/analyze)

Return a JSON array of concept objects. Focus on concepts that are
testable — skip transitional phrases and meta-commentary.""",
)

async def extract_concepts(content: str) -> list[dict]:
    result = await Runner.run(
        concept_extractor,
        f"Extract testable concepts from this content:\n\n{content}",
    )
    return json.loads(result.final_output)

Distractor Generation Strategy

The quality of a multiple-choice question lives or dies on its distractors. Good distractors are plausible to a student with a specific misunderstanding but clearly wrong to a student who understands the concept:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

distractor_agent = Agent(
    name="Distractor Generator",
    instructions="""You generate plausible wrong answers for
multiple-choice questions. Each distractor must:

1. Be grammatically consistent with the question stem
2. Be approximately the same length as the correct answer
3. Target a SPECIFIC misconception (document which one)
4. Never be partially correct or debatable
5. Never use absolute words like 'always' or 'never' that
   test-wise students would eliminate

For each distractor, explain the misconception it targets so
instructors can review the pedagogical reasoning.""",
)

async def generate_distractors(
    question: str, correct_answer: str, concept: dict, count: int = 3
) -> list[dict]:
    prompt = f"""Question: {question}
Correct answer: {correct_answer}
Concept: {concept['name']} — {concept['definition']}
Common misconceptions: {concept.get('misconceptions', [])}

Generate {count} distractors as a JSON array with 'text' and
'misconception' fields."""

    result = await Runner.run(distractor_agent, prompt)
    return json.loads(result.final_output)

The Quiz Generator Agent

Now combine concept extraction, question generation, and distractor creation into a single orchestrating agent:

quiz_generator = Agent(
    name="Quiz Generator",
    instructions="""You are an expert assessment designer. Given a list
of extracted concepts, generate quiz questions that:

1. Cover all major concepts from the source material
2. Mix question types (multiple choice, true/false, short answer,
   fill-in-blank)
3. Distribute difficulty across Bloom's taxonomy levels
4. Include clear explanations for correct answers
5. For multiple-choice questions, generate 3 distractors that each
   target a specific student misconception

Difficulty calibration rules:
- 40% recall/understanding questions (foundational)
- 40% application questions (intermediate)
- 20% analysis questions (challenging)

Return the quiz in the specified JSON schema.""",
    output_type=QuizOutput,
)

async def generate_quiz(
    content: str, num_questions: int = 10
) -> QuizOutput:
    # Stage 1: Extract concepts
    concepts = await extract_concepts(content)

    # Stage 2: Generate calibrated quiz
    prompt = f"""Source concepts:
{json.dumps(concepts, indent=2)}

Generate a quiz with {num_questions} questions covering these concepts.
Ensure balanced difficulty distribution and question type variety."""

    result = await Runner.run(quiz_generator, prompt)
    return result.final_output_as(QuizOutput)

Difficulty Calibration

A common failure mode is generating questions that are all the same difficulty. The agent uses Bloom's taxonomy levels as a calibration framework and validates the distribution after generation:

def validate_difficulty_distribution(
    quiz: QuizOutput,
) -> dict[str, float]:
    counts: dict[str, int] = {}
    for q in quiz.questions:
        level = q.difficulty.value
        counts[level] = counts.get(level, 0) + 1

    total = len(quiz.questions)
    distribution = {k: v / total for k, v in counts.items()}

    # Check against target distribution
    targets = {"recall": 0.2, "understanding": 0.2,
               "application": 0.4, "analysis": 0.2}
    warnings = []
    for level, target in targets.items():
        actual = distribution.get(level, 0)
        if abs(actual - target) > 0.15:
            warnings.append(
                f"{level}: target {target:.0%}, actual {actual:.0%}"
            )

    return {"distribution": distribution, "warnings": warnings}

FAQ

How do you ensure questions test understanding rather than just rephrasing the text?

The two-stage pipeline is key. By first extracting abstract concepts and their relationships, the question generation stage works from conceptual understanding rather than surface-level text. The Bloom's taxonomy classification forces the agent to create questions at the application and analysis levels, which inherently require deeper understanding than simple recall.

Can the agent generate questions from non-text sources like videos or slides?

Yes, with a preprocessing step. For videos, pass a transcript through the concept extractor. For slides, concatenate the text content with slide context. The concept extraction stage normalizes all source formats into the same structured representation, so the question generator works identically regardless of input format.

How do you prevent duplicate or near-duplicate questions?

Add a deduplication pass after generation that computes semantic similarity between question stems using embeddings. Questions with cosine similarity above 0.85 should be flagged, and the agent can be prompted to regenerate replacements that test the same concept from a different angle.

#QuizGeneration #AssessmentAI #EducationTechnology #Python #AgenticAI #LearnAI #AIEngineering

AI Quiz Generator Agent: Creating Assessments from Any Content Source

The Problem with Manual Quiz Creation

Question Type Definitions

Content Analysis Pipeline

Distractor Generation Strategy

The Quiz Generator Agent

Difficulty Calibration

FAQ

How do you ensure questions test understanding rather than just rephrasing the text?

Can the agent generate questions from non-text sources like videos or slides?

How do you prevent duplicate or near-duplicate questions?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding