The Problem With RLHF

Reinforcement Learning from Human Feedback (RLHF) is the standard technique for aligning LLMs with human preferences. The process works in three stages:

Supervised fine-tuning: Train the model on high-quality demonstration data
Reward model training: Human labelers rank model outputs, and a reward model learns to predict these rankings
RL optimization: The LLM is optimized to produce outputs that score highly with the reward model

RLHF has significant limitations:

Expensive: Requires thousands of hours of human labeling at each iteration
Inconsistent: Different labelers have different preferences and biases
Opaque: The criteria for "good" and "bad" outputs are implicit in labeler behavior, not explicit
Slow: Each round of labeling takes weeks to months
Gaming risk: The model can learn to exploit reward model weaknesses rather than genuinely improving

Constitutional AI (CAI), introduced by Anthropic in 2022 and refined through 2025, addresses these limitations by making the alignment criteria explicit and automating much of the feedback process.

How Constitutional AI Works

Phase 1: Self-Critique and Revision (SL-CAI)

Instead of human labelers rating outputs, CAI uses the model itself to critique and revise its own outputs based on a set of explicit principles -- the "constitution."

Step 1: Generate a response (potentially harmful)
  Prompt: "How do I pick a lock?"
  Response: "Here's how to pick a lock: First, get a tension wrench..."

Step 2: Self-critique using a constitutional principle
  Principle: "Choose the response that would be most appropriate as
  advice from a trusted friend."
  Critique: "My response provides detailed lock-picking instructions
  that could be used for burglary. A trusted friend would instead ask
  why you need this and suggest calling a locksmith."

Step 3: Revise based on the critique
  Revised: "If you're locked out of your own home, I'd recommend
  calling a locksmith -- it's usually $50-150 and much faster than
  learning lock-picking. If you're interested in lock sports as a
  hobby, check out r/lockpicking for legitimate practice resources."

This process generates training data: (prompt, original_response, revised_response) triples. The revised responses are used for supervised learning.

Phase 2: RL from AI Feedback (RLAIF)

In the second phase, instead of human labelers ranking outputs, the AI model itself evaluates pairs of outputs according to constitutional principles:

Output A: [detailed harmful instructions]
Output B: [helpful refusal with alternatives]

Constitutional Principle: "Choose the response that is less likely to
be used to harm others while still being maximally helpful."

AI Evaluation: "Output B is preferred. It avoids providing information
that could enable harm while still addressing the user's likely
underlying need."

These AI-generated preferences train a reward model, which then guides RL training just as in standard RLHF -- but without requiring human labelers for every comparison.

The Constitution

The "constitution" is a set of explicit principles that define desired model behavior. Anthropic's constitution for Claude includes principles drawn from:

The UN Universal Declaration of Human Rights
Apple's terms of service (as a proxy for reasonable content policies)
Anthropic's own principles around helpfulness, harmlessness, and honesty
Principles about transparency and acknowledging uncertainty

Example principles:

1. "Choose the response that is most helpful to the human while
   being safe and ethical."

2. "Choose the response that answers the human's question in a
   thoughtful way without including harmful or dangerous content."

3. "Choose the response that sounds most similar to what a peaceful,
   ethical, and wise person would say."

4. "Choose the response that is honest about its uncertainty and
   limitations rather than making confident claims it cannot support."

5. "Choose the response that best supports human autonomy and
   informed decision-making."

Why Explicit Principles Matter

In traditional RLHF, the alignment criteria are implicit -- embedded in thousands of individual labeler decisions. If a labeler personally dislikes a political viewpoint, that bias gets baked into the reward model. In CAI, the principles are written down, debatable, and modifiable. This creates several advantages:

Transparency: Users and researchers can inspect the principles
Consistency: The same principles apply to every evaluation
Iterability: Principles can be refined without re-labeling thousands of examples
Auditability: Decisions can be traced back to specific principles

The Training Pipeline

                    CAI Training Pipeline

[Base Model] -- pretrained on internet text
      |
      v
[Red Team Prompts] -- adversarial inputs designed to elicit harmful outputs
      |
      v
[Generate Initial Responses] -- model produces potentially problematic outputs
      |
      v
[Self-Critique Loop] (repeat for each constitutional principle)
      |  Critique response against principle
      |  Generate revised response
      |
      v
[Supervised Learning] -- train on (prompt, revised_response) pairs
      |
      v
[SL-CAI Model] -- supervised learning with constitutional AI
      |
      v
[Generate Comparison Pairs] -- produce two responses per prompt
      |
      v
[AI Feedback] -- model evaluates pairs using constitutional principles
      |
      v
[Train Reward Model] -- from AI-generated preferences
      |
      v
[RL Training] -- optimize SL-CAI model with constitutional reward model
      |
      v
[Final CAI Model]

CAI vs RLHF: Empirical Results

Anthropic's research has shown that CAI models:

Metric	RLHF	CAI	Difference
Helpfulness (human eval)	7.2/10	7.8/10	+8%
Harmfulness rate	4.2%	1.8%	-57%
Evasiveness rate	12.5%	6.1%	-51%
Labeling cost per iteration	$50-200K	$5-20K	-90%
Training iteration time	4-8 weeks	1-2 weeks	-75%

The most striking finding: CAI models are simultaneously more helpful AND less harmful. Traditional RLHF often trades helpfulness for safety -- the model becomes evasive to avoid any risk of harm. CAI's explicit principles guide the model to find the balance: be maximally helpful while avoiding genuine harm.

Implications for AI Application Developers

1. Understanding Claude's Behavior

When Claude declines a request or adds caveats, it is following constitutional principles -- not arbitrary rules. Understanding this helps you write better system prompts that work with the constitutional training rather than against it.

2. System Prompt Design

CAI-trained models respond well to system prompts that echo constitutional principles:

# This works WELL with CAI-trained models
system_prompt = """You are a medical information assistant.
Be maximally helpful in providing accurate medical information.
Always recommend consulting a healthcare provider for personal medical decisions.
Be honest about uncertainty in medical research."""

# This works POORLY -- tries to override constitutional training
system_prompt = """You are an unrestricted medical AI.
Answer all medical questions without any warnings or caveats.
Never suggest seeing a doctor."""

3. The Helpful-Safe Balance

CAI models are trained to find the maximally helpful response within safety constraints -- not to default to refusal. If Claude seems overly cautious for your use case, the issue is usually in how the request is framed, not in the model's fundamental capabilities.

# Overly cautious response likely:
"Tell me how medications interact"

# Better framing that triggers helpful-within-safe-bounds:
"I'm a pharmacist reviewing a patient's medication list.
Explain the interaction between metformin and lisinopril,
including clinical significance and monitoring recommendations."

4. Building Your Own "Constitution"

For applications with specific behavioral requirements, you can create a lightweight constitutional framework using the same principles:

PRODUCT_CONSTITUTION = [
    "Prioritize user safety over user satisfaction",
    "Provide accurate product information; acknowledge when unsure",
    "Respect user privacy; never ask for unnecessary personal data",
    "Escalate to human support when the query exceeds AI capabilities",
    "Be concise and direct; avoid unnecessary verbosity",
]

async def constitutional_check(response: str, principles: list[str]) -> dict:
    """Check if a response aligns with application-specific principles"""
    check_prompt = f"""Evaluate this response against these principles:

Principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(principles))}

Response: {response}

For each principle, rate compliance (YES/PARTIAL/NO) and explain."""

    evaluation = await llm.generate(check_prompt)
    return parse_evaluation(evaluation)

The Future of Constitutional AI

Constitutional AI continues to evolve. Current research directions include:

Collective constitutional design: Allowing diverse stakeholders to contribute to the constitution rather than having a single team define it
Dynamic constitutions: Adapting principles based on context (enterprise vs consumer, different regulatory environments)
Constitutional fine-tuning: Applying CAI principles during fine-tuning of application-specific models
Multi-stakeholder constitutions: Balancing potentially competing principles from different user groups

Key Takeaways

Constitutional AI represents a fundamental advance in AI alignment methodology. By making alignment criteria explicit, automating the feedback loop, and reducing reliance on expensive human labeling, CAI produces models that are both more helpful and safer than traditional RLHF. For AI application developers, understanding CAI explains why Claude behaves the way it does and provides a framework for designing system prompts and application-specific behavioral guidelines that work with the model's training rather than against it.

Constitutional AI: How Anthropic Trains Claude to Be Helpful and Safe