LangSmith: Tracing, Debugging, and Evaluating LangChain Applications
Set up LangSmith for tracing LangChain runs, analyzing run trees, building evaluation datasets, running automated evaluations, and collecting feedback on LLM outputs.
Why You Need Observability for LLM Applications
LLM applications are notoriously difficult to debug. A chain might call three models, two tools, and a retriever — and when the final answer is wrong, you need to trace exactly which step failed and why. Logging raw inputs and outputs is not enough when calls are nested and asynchronous.
LangSmith is the observability and evaluation platform built specifically for LangChain (and any LLM application). It captures detailed traces of every run, lets you visualize the execution tree, and provides tools for systematic evaluation.
Setting Up Tracing
LangSmith tracing requires an API key and two environment variables. Once set, all LangChain operations are traced automatically.
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_your_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-project"
# That is it. All LangChain operations are now traced.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
chain = (
ChatPromptTemplate.from_template("Explain {topic}")
| ChatOpenAI(model="gpt-4o-mini")
| StrOutputParser()
)
# This call is automatically traced in LangSmith
result = chain.invoke({"topic": "quantum computing"})
Every invocation creates a run in the LangSmith dashboard. You can see the input, output, latency, token usage, and cost for each component in the chain.
Understanding Run Trees
A run tree is a hierarchical view of a single invocation. For a RAG chain, the tree might look like:
- Chain Run (root)
- Retriever Run — query, returned documents, latency
- Prompt Run — template variables, formatted prompt
- LLM Run — model, temperature, prompt tokens, completion tokens, response
- Parser Run — raw input, parsed output
Each node shows timing, inputs, outputs, and any errors. This lets you identify bottlenecks (slow retrievals) and failures (parsing errors) instantly.
Custom Tracing with the @traceable Decorator
Trace any Python function, not just LangChain components.
from langsmith import traceable
@traceable(name="process_order")
def process_order(order_id: str, items: list[str]) -> dict:
# Your business logic
total = calculate_total(items)
result = submit_order(order_id, items, total)
return {"order_id": order_id, "total": total, "status": result}
# This function call appears as a traced run in LangSmith
process_order("ORD-123", ["widget", "gadget"])
@traceable functions nest correctly inside LangChain traces. If a traced function calls a LangChain chain, both appear in the same run tree.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Building Evaluation Datasets
LangSmith lets you create datasets of input-output examples for systematic evaluation.
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
"customer-support-qa",
description="Questions and expected answers for customer support bot",
)
# Add examples
client.create_examples(
inputs=[
{"question": "How do I reset my password?"},
{"question": "What is your refund policy?"},
{"question": "How do I cancel my subscription?"},
],
outputs=[
{"answer": "Go to Settings > Security > Reset Password."},
{"answer": "Full refund within 30 days of purchase."},
{"answer": "Go to Settings > Subscription > Cancel."},
],
dataset_id=dataset.id,
)
You can also create datasets from traced runs — select successful or failed runs from the dashboard and convert them into evaluation examples.
Running Evaluations
Evaluators score your chain's outputs against expected results.
from langsmith.evaluation import evaluate, LangChainStringEvaluator
# Define evaluators
correctness = LangChainStringEvaluator("qa") # LLM-based QA evaluator
relevance = LangChainStringEvaluator("criteria", config={
"criteria": "relevance",
})
# Run evaluation
def predict(inputs: dict) -> dict:
result = chain.invoke(inputs)
return {"answer": result}
results = evaluate(
predict,
data="customer-support-qa",
evaluators=[correctness, relevance],
experiment_prefix="v1-gpt4o-mini",
)
print(results.to_pandas())
Each evaluation run creates an experiment in LangSmith. You can compare experiments side by side to measure the impact of prompt changes, model upgrades, or chain modifications.
Custom Evaluators
Write evaluators for domain-specific quality criteria.
from langsmith.schemas import Run, Example
def check_citation(run: Run, example: Example) -> dict:
"""Check if the response cites a source."""
output = run.outputs.get("answer", "")
has_citation = "source:" in output.lower() or "[" in output
return {
"key": "has_citation",
"score": 1 if has_citation else 0,
}
results = evaluate(
predict,
data="customer-support-qa",
evaluators=[check_citation],
)
Collecting Human Feedback
LangSmith supports feedback collection on individual runs via the API or dashboard.
from langsmith import Client
client = Client()
# After a user rates a response
client.create_feedback(
run_id="run-uuid-here",
key="user_rating",
score=1, # thumbs up
comment="Accurate and helpful",
)
Feedback data powers fine-tuning datasets and helps identify where your application needs improvement.
FAQ
Is LangSmith required to use LangChain?
No. LangSmith is optional. LangChain works without it. But for any production application, the observability and evaluation capabilities are essential for debugging issues, measuring quality, and iterating on prompts and chains.
What does LangSmith cost?
LangSmith has a free tier with limited trace retention. Paid tiers offer longer retention, more traces, and team collaboration features. Check the current pricing at smith.langchain.com.
Can I use LangSmith with non-LangChain code?
Yes. The @traceable decorator and the LangSmith SDK work with any Python code. You can trace raw OpenAI API calls, custom HTTP requests, or any function. LangSmith is not limited to LangChain applications.
#LangSmith #Observability #LLMEvaluation #Debugging #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.