Chain-of-Thought Prompting: Making LLMs Reason Step by Step

Why LLMs Need to Show Their Work

Large language models are powerful pattern matchers, but they struggle with tasks that require multi-step reasoning — math problems, logical deductions, and complex analysis. Chain-of-thought (CoT) prompting addresses this by instructing the model to break its reasoning into explicit intermediate steps before arriving at a final answer.

The technique was formalized by Wei et al. in 2022 and has since become standard practice. On arithmetic reasoning benchmarks, CoT prompting improved GPT-3's accuracy from 17.9% to 78.7% — without changing the model at all.

Manual Chain-of-Thought

The simplest form of CoT is adding "Let's think step by step" to your prompt, or providing an example that demonstrates the reasoning process:

from openai import OpenAI

client = OpenAI()

# Without CoT — the model often gets this wrong
naive_prompt = "If a store has 45 apples, sells 60% of them, then receives a shipment of 30 more, how many apples does it have?"

# With CoT — explicit reasoning steps
cot_prompt = """If a store has 45 apples, sells 60% of them, then receives a shipment of 30 more, how many apples does it have?

Let's solve this step by step:
1. Start with the initial count
2. Calculate how many were sold
3. Subtract the sold amount
4. Add the new shipment
5. State the final answer"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": cot_prompt}]
)

print(response.choices[0].message.content)

The model will now walk through each step: 45 * 0.6 = 27 sold, 45 - 27 = 18 remaining, 18 + 30 = 48 total. Without CoT, models frequently jump to incorrect answers on multi-step problems.

Few-Shot CoT with Examples

For domain-specific reasoning, provide examples that demonstrate the full chain of thought:

def build_cot_diagnosis_prompt(symptoms: str) -> list[dict]:
    return [
        {
            "role": "system",
            "content": (
                "You are a network diagnostics assistant. "
                "For each problem, reason through possible causes "
                "step by step before giving your diagnosis."
            )
        },
        {
            "role": "user",
            "content": "Users report the website loads slowly but only during business hours."
        },
        {
            "role": "assistant",
            "content": (
                "Step 1: The issue is time-dependent, occurring during business hours. "
                "This suggests a load-related problem rather than a configuration issue.\n"
                "Step 2: During business hours, traffic volume is highest. "
                "This could overwhelm server capacity or saturate bandwidth.\n"
                "Step 3: If the server CPU/memory is fine, the bottleneck may be "
                "at the database level — slow queries under concurrent load.\n"
                "Step 4: Another possibility is connection pool exhaustion if the "
                "application has a fixed pool size.\n\n"
                "Diagnosis: Most likely cause is database connection pool exhaustion "
                "or slow queries under concurrent load. Check connection pool metrics "
                "and slow query logs during peak hours."
            )
        },
        {
            "role": "user",
            "content": symptoms
        }
    ]

The example teaches the model the format (numbered steps followed by a diagnosis) and the depth of reasoning expected.

Auto-CoT: Zero-Shot Chain-of-Thought

You do not always need hand-crafted examples. Simply appending a trigger phrase activates reasoning in capable models:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def auto_cot_prompt(question: str) -> str:
    return f"""{question}

Think through this step by step. After your reasoning, provide your final answer on a new line starting with 'ANSWER:'"""


def extract_answer(response_text: str) -> str:
    """Pull the final answer from a CoT response."""
    for line in response_text.split("\n"):
        if line.strip().startswith("ANSWER:"):
            return line.strip().replace("ANSWER:", "").strip()
    return response_text.strip()

This pattern is called "zero-shot CoT" because it requires no examples. It works well with GPT-4-class models but less reliably with smaller models.

Tree-of-Thought: Exploring Multiple Paths

Tree-of-thought (ToT) extends CoT by having the model explore multiple reasoning paths and evaluate which one is most promising:

tot_prompt = """Problem: Design a database schema for a multi-tenant SaaS application that needs per-tenant data isolation.

Generate 3 different approaches. For each approach:
1. Describe the schema design
2. List the pros
3. List the cons
4. Rate its suitability (1-10) for our requirements: 1000+ tenants, strict data isolation, moderate query complexity

After evaluating all approaches, recommend the best one with justification."""

ToT is particularly effective for design decisions, planning tasks, and problems where the optimal solution is not immediately obvious.

When CoT Helps (and When It Doesn't)

CoT helps most with:

Multi-step arithmetic and logic problems
Tasks requiring comparison or ranking
Code debugging (reasoning through execution flow)
Complex classification with multiple factors

CoT adds little value for:

Simple retrieval tasks ("What is the capital of France?")
Creative writing and generation
Single-step transformations (translation, formatting)
Tasks where speed and token efficiency matter more than accuracy

# CoT is overkill here — simple classification
simple_task = "Is this email spam? Subject: 'Meeting at 3pm tomorrow'"

# CoT genuinely helps here — multi-factor decision
complex_task = """Given these server metrics, determine if we should scale up:
- CPU: 78% avg, 95% peak
- Memory: 62% avg, 71% peak
- Request latency: p50=120ms, p99=2.3s
- Error rate: 0.8%
- Current instance count: 3
- Time: Tuesday 2pm (peak hours start at 3pm)"""

FAQ

Does chain-of-thought prompting increase token usage?

Yes, substantially. CoT responses are typically 3-10x longer than direct answers because the model generates all intermediate reasoning steps. For high-volume applications, weigh the accuracy improvement against the increased cost and latency.

Can I hide the reasoning and show only the final answer?

Yes. Use the extraction pattern shown above — instruct the model to put its final answer after a marker like "ANSWER:" or "FINAL:" and parse it from the response. Some APIs also support hiding reasoning tokens from the billed output.

Should I use CoT with every prompt?

No. Reserve CoT for tasks where the model makes errors without it. Adding "think step by step" to simple tasks wastes tokens without improving quality. Test with and without CoT on a representative sample to measure the actual impact.

#ChainOfThought #Reasoning #PromptEngineering #LLM #Python #AgenticAI #LearnAI #AIEngineering

Chain-of-Thought Prompting: Making LLMs Reason Step by Step

Why LLMs Need to Show Their Work

Manual Chain-of-Thought

Few-Shot CoT with Examples

Auto-CoT: Zero-Shot Chain-of-Thought

Tree-of-Thought: Exploring Multiple Paths

When CoT Helps (and When It Doesn't)

FAQ

Does chain-of-thought prompting increase token usage?

Can I hide the reasoning and show only the final answer?

Should I use CoT with every prompt?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding