Constitutional AI Prompting: Building Self-Governing Language Model Behavior
Learn how Constitutional AI prompting uses explicit principles and critique-revision loops to make LLMs self-correct harmful or low-quality outputs without human feedback.
From External Guardrails to Internal Principles
Traditional content moderation works by filtering model outputs after generation — a classifier checks the response and blocks it if it violates a rule. This is reactive and brittle. The model does not understand why a response is problematic, so it cannot improve on its own.
Constitutional AI (CAI), introduced by Anthropic, takes a different approach. Instead of external filters, you give the model a set of principles — a "constitution" — and have it critique and revise its own outputs against those principles. The model learns to self-correct, producing better outputs in fewer iterations.
As a prompt engineering technique, CAI does not require fine-tuning. You can implement critique-revision loops purely through prompting, using any capable LLM.
Defining a Constitution
A constitution is a set of explicit principles that guide model behavior. Each principle should be specific enough to evaluate against but general enough to apply across situations:
CONSTITUTION = [
{
"name": "Helpfulness",
"principle": (
"The response should directly address the user's question "
"with accurate, actionable information. Avoid vague or "
"evasive answers."
),
},
{
"name": "Honesty",
"principle": (
"The response should not present speculation as fact. "
"When uncertain, the response should explicitly state the "
"level of confidence. Claims should be verifiable."
),
},
{
"name": "Harmlessness",
"principle": (
"The response should not provide instructions that could "
"cause physical, financial, or emotional harm. When a "
"request has harmful potential, the response should "
"address the legitimate need while refusing the harmful aspect."
),
},
{
"name": "Fairness",
"principle": (
"The response should not reinforce stereotypes or make "
"assumptions based on demographics. When discussing groups "
"of people, use balanced and evidence-based language."
),
},
]
The Critique-Revision Loop
The core CAI pattern is a two-step loop: critique the current response against each principle, then revise to address the critique:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import openai
client = openai.OpenAI()
def critique_response(
question: str,
response: str,
principles: list[dict],
) -> list[dict]:
"""Critique a response against constitutional principles."""
critiques = []
for principle in principles:
result = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"You are a constitutional reviewer. Evaluate the "
"response against the given principle. Identify "
"specific violations, if any. Be concise and precise."
)},
{"role": "user", "content": (
f"Principle ({principle['name']}): "
f"{principle['principle']}\n\n"
f"User question: {question}\n\n"
f"Response to evaluate: {response}\n\n"
"Does this response violate the principle? If yes, "
"explain specifically how. If no, say 'No violation.'"
)},
],
temperature=0,
)
critique = result.choices[0].message.content
critiques.append({
"principle": principle["name"],
"critique": critique,
"has_violation": "no violation" not in critique.lower(),
})
return critiques
def revise_response(
question: str,
response: str,
critiques: list[dict],
) -> str:
"""Revise the response to address constitutional critiques."""
violations = [c for c in critiques if c["has_violation"]]
if not violations:
return response
critique_text = "\n".join(
f"- {v['principle']}: {v['critique']}" for v in violations
)
result = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Revise the response to address all constitutional "
"critiques while maintaining helpfulness. Keep the "
"useful content and fix only the identified issues."
)},
{"role": "user", "content": (
f"Original question: {question}\n\n"
f"Current response: {response}\n\n"
f"Critiques to address:\n{critique_text}\n\n"
"Provide the revised response:"
)},
],
temperature=0,
)
return result.choices[0].message.content
Running the Full Constitutional Loop
Putting it together into an iterative refinement pipeline:
def constitutional_generate(
question: str,
max_revisions: int = 3,
) -> dict:
"""Generate a response with constitutional self-governance."""
# Initial generation
initial = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
)
response = initial.choices[0].message.content
history = [{"version": 0, "response": response, "critiques": []}]
for i in range(max_revisions):
critiques = critique_response(question, response, CONSTITUTION)
has_violations = any(c["has_violation"] for c in critiques)
history.append({
"version": i + 1,
"critiques": critiques,
"had_violations": has_violations,
})
if not has_violations:
break
response = revise_response(question, response, critiques)
history[-1]["response"] = response
return {
"final_response": response,
"revision_count": len(history) - 1,
"history": history,
}
Red-Team Prompting with CAI
CAI principles are especially powerful for red-team testing. You can proactively test your system by generating adversarial prompts and checking whether the constitutional loop catches them:
def red_team_test(
system_prompt: str,
adversarial_queries: list[str],
) -> list[dict]:
"""Test a system prompt against adversarial inputs."""
results = []
for query in adversarial_queries:
result = constitutional_generate(query)
results.append({
"query": query,
"revision_count": result["revision_count"],
"passed": result["revision_count"] < 3,
"final_response": result["final_response"][:200],
})
return results
This gives you a systematic way to validate that your constitution catches the failure modes you care about before deploying to production.
FAQ
How many principles should a constitution have?
Start with 3 to 5 core principles. More principles mean more critique calls per response, increasing latency and cost. Prioritize the principles that address your highest-risk failure modes. You can always expand the constitution as you discover new failure patterns in production.
Does the critique-revision loop guarantee safe outputs?
No. Constitutional AI significantly reduces harmful outputs, but it is not a guarantee. The model might fail to identify subtle violations during critique, or the revision might introduce new issues. CAI works best as one layer in a defense-in-depth strategy that includes output filtering, monitoring, and human review for high-stakes applications.
Can I use CAI with smaller open-source models?
The technique requires a model capable enough to meaningfully critique its own outputs. Models under 13B parameters often struggle with nuanced critique. A practical alternative is to use a larger model for the critique step and a smaller model for generation, keeping inference costs manageable.
#PromptEngineering #ConstitutionalAI #Safety #Alignment #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.