Prompt Task Classification and Complexity Evaluation: NVIDIA's DeBERTa-Based Framework Explained

What Is Prompt Task Classification?

Prompt task classification is the process of automatically categorizing user prompts by their intended task type and evaluating their complexity. This capability is essential for LLM routing, synthetic data curation, and understanding how users interact with AI systems.

NVIDIA released the prompt-task-and-complexity-classifier, a multi-headed DeBERTa-based model that classifies English text prompts across 11 task types and scores them on 6 complexity dimensions. The model is available on Hugging Face under NVIDIA's Open Model License and is ready for commercial use.

The 11 Task Types

The classifier identifies which of the following task categories a prompt belongs to:

flowchart TD
    START["Prompt Task Classification and Complexity Evaluat…"] --> A
    A["What Is Prompt Task Classification?"]
    A --> B
    B["The 11 Task Types"]
    B --> C
    C["The 6 Complexity Dimensions"]
    C --> D
    D["Overall Complexity Score"]
    D --> E
    E["Model Architecture"]
    E --> F
    F["Training Data and Performance"]
    F --> G
    G["Practical Applications"]
    G --> H
    H["Integration with NeMo Curator"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

1. Open QA

General knowledge questions where the answer is not constrained by a provided context. Example: "What causes ocean tides?"

2. Closed QA

Questions that must be answered based on specific provided text or data. Example: "Based on the passage above, what year was the company founded?"

3. Summarization

Prompts requesting condensation of information into shorter form. Example: "Summarize the key findings of this research paper."

4. Text Generation

Creative or structured writing tasks. Example: "Write a product description for a wireless keyboard."

5. Code Generation

Requests to produce code in any programming language. Example: "Write a Python function that validates email addresses."

6. Chatbot

Conversational interactions requiring dialogue management. Example: "You are a helpful travel assistant. Help me plan a trip to Japan."

7. Classification

Prompts asking the model to categorize content. Example: "Is this customer review positive, negative, or neutral?"

8. Rewrite

Requests to rephrase or restructure existing text. Example: "Rewrite this paragraph in simpler language."

9. Brainstorming

Prompts requesting idea generation. Example: "Give me 10 marketing campaign ideas for a fitness app."

10. Extraction

Pulling specific information from text. Example: "Extract all dates and monetary amounts from this contract."

11. Other

Uncategorized prompts that do not fit the above categories.

The 6 Complexity Dimensions

Beyond task type, the classifier evaluates prompt complexity across six dimensions, each scored between 0 and 1:

Creativity Score

Measures the level of creative thinking required. A factual lookup scores near 0; writing a mystery novel with constraints scores near 0.9.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Reasoning Score

Evaluates the logical and cognitive effort required. Simple recall tasks score low; multi-step math problems or logical deduction tasks score high.

Contextual Knowledge

Assesses how much background information is needed beyond what the prompt provides. Self-contained prompts score low; prompts requiring world knowledge score higher.

Domain Knowledge

Measures the level of specialized expertise required. General prompts score low; medical diagnosis or legal analysis prompts score high.

Constraints

Quantifies the number of conditions or requirements in the prompt. "Write a story" has few constraints; "Write a 500-word story in first person, set in Victorian London, with a twist ending" has many.

Number of Few Shots

Counts the number of examples provided in the prompt. Zero-shot prompts score 0; prompts with multiple examples score proportionally higher.

Overall Complexity Score

The model computes a weighted overall complexity score using this formula:

Score = 0.35 x Creativity + 0.25 x Reasoning + 0.15 x Constraints + 0.15 x Domain Knowledge + 0.05 x Contextual Knowledge + 0.05 x Few Shots

The weighting prioritizes creativity and reasoning as the strongest indicators of prompt difficulty, followed by constraints and domain expertise.

Model Architecture

The classifier uses DeBERTa-v3-base as its backbone with multiple classification heads, one dedicated to each task type and complexity dimension. The architecture applies mean pooling over token embeddings before passing representations to each head.

flowchart TD
    ROOT["Prompt Task Classification and Complexity Ev…"] 
    ROOT --> P0["The 11 Task Types"]
    P0 --> P0C0["1. Open QA"]
    P0 --> P0C1["2. Closed QA"]
    P0 --> P0C2["3. Summarization"]
    P0 --> P0C3["4. Text Generation"]
    ROOT --> P1["The 6 Complexity Dimensions"]
    P1 --> P1C0["Creativity Score"]
    P1 --> P1C1["Reasoning Score"]
    P1 --> P1C2["Contextual Knowledge"]
    P1 --> P1C3["Domain Knowledge"]
    ROOT --> P2["Practical Applications"]
    P2 --> P2C0["LLM Routing"]
    P2 --> P2C1["Synthetic Data Curation"]
    P2 --> P2C2["Prompt Quality Analysis"]
    P2 --> P2C3["User Behavior Analytics"]
    ROOT --> P3["Frequently Asked Questions"]
    P3 --> P3C0["What is prompt task classification?"]
    P3 --> P3C1["How accurate is NVIDIA39s prompt comple…"]
    P3 --> P3C2["Can the prompt classifier be used for L…"]
    P3 --> P3C3["What hardware is required to run the pr…"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Key specifications:

Token Limit: 512 tokens (prompts longer than this are truncated)
Output: Simultaneous predictions across all heads in a single forward pass
Inference Hardware: NVIDIA GPU with compute capability 7.0+ (Volta or higher)
Framework: PyTorch with Hugging Face Transformers

Training Data and Performance

The model was trained on 4,024 human-annotated English prompts distributed across all 11 task types. Open QA prompts (1,214 samples) are the most represented category, while Extraction prompts (60 samples) are the least.

Cross-validation results demonstrate strong performance:

Task Type Accuracy: 98.1%
Creativity Accuracy: 99.6%
Reasoning Accuracy: 99.7%
Contextual Knowledge Accuracy: 98.1%
Domain Knowledge Accuracy: 93.7%
Constraints Accuracy: 99.1%

Practical Applications

LLM Routing

Use the classifier to route prompts to the most appropriate model. Simple factual queries go to smaller, faster models. Complex creative or reasoning tasks go to larger, more capable models. This reduces inference costs while maintaining output quality.

flowchart LR
    S0["1. Open QA"]
    S0 --> S1
    S1["2. Closed QA"]
    S1 --> S2
    S2["3. Summarization"]
    S2 --> S3
    S3["4. Text Generation"]
    S3 --> S4
    S4["5. Code Generation"]
    S4 --> S5
    S5["6. Chatbot"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S5 fill:#059669,stroke:#047857,color:#fff

Synthetic Data Curation

When generating synthetic training data, the classifier ensures balanced representation across task types and complexity levels. Without this balance, models trained on synthetic data may excel at simple tasks but fail on complex ones.

Prompt Quality Analysis

Evaluate prompt datasets to understand their composition. If 80% of your prompts are Open QA and only 2% are Code Generation, your model may underperform on coding tasks.

User Behavior Analytics

Track how users interact with your AI system. Understanding the distribution of task types and complexity levels helps prioritize model improvements and identify capability gaps.

Integration with NeMo Curator

The classifier integrates directly with NVIDIA NeMo Curator for large-scale, GPU-accelerated prompt classification. NeMo Curator handles distributed processing, enabling classification of millions of prompts across multiple GPUs. A tutorial notebook is available in the NeMo Curator GitHub repository.

Frequently Asked Questions

What is prompt task classification?

Prompt task classification is the automated process of categorizing user prompts by their intended task type (such as question answering, code generation, or summarization) and evaluating their complexity across multiple dimensions. NVIDIA's DeBERTa-based classifier handles both classification and complexity scoring in a single forward pass, making it efficient for large-scale analysis.

How accurate is NVIDIA's prompt complexity classifier?

The model achieves 98.1% accuracy on task type classification and 93.7-99.7% accuracy across the six complexity dimensions, based on 10-fold cross-validation on 4,024 human-annotated samples. Task type and creativity classification are the strongest, while domain knowledge classification has slightly lower accuracy.

Can the prompt classifier be used for LLM routing?

Yes. The classifier's task type and complexity predictions can drive routing decisions, sending simple prompts to smaller models and complex prompts to larger ones. This approach reduces inference costs by 30-60% while maintaining output quality, as simple prompts do not need the full capabilities of frontier models.

What hardware is required to run the prompt classifier?

The model requires an NVIDIA GPU with compute capability 7.0 or higher (Volta architecture or newer), CUDA 12.0+, and Python 3.10. It runs on PyTorch and uses the Hugging Face Transformers library. For production deployment, an A10G or similar GPU is recommended.

How does prompt complexity scoring work?

The model evaluates six dimensions — creativity, reasoning, contextual knowledge, domain knowledge, constraints, and few-shot examples — each scored 0 to 1. An overall complexity score is computed as a weighted average, with creativity (0.35) and reasoning (0.25) carrying the most weight. This multi-dimensional approach captures nuances that a single complexity score would miss.