LangSmith: Tracing, Debugging, and Evaluating LangChain Applications

Why You Need Observability for LLM Applications

LLM applications are notoriously difficult to debug. A chain might call three models, two tools, and a retriever — and when the final answer is wrong, you need to trace exactly which step failed and why. Logging raw inputs and outputs is not enough when calls are nested and asynchronous.

LangSmith is the observability and evaluation platform built specifically for LangChain (and any LLM application). It captures detailed traces of every run, lets you visualize the execution tree, and provides tools for systematic evaluation.

Setting Up Tracing

LangSmith tracing requires an API key and two environment variables. Once set, all LangChain operations are traced automatically.

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_your_key_here"
os.environ["LANGCHAIN_PROJECT"] = "my-project"

# That is it. All LangChain operations are now traced.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

chain = (
    ChatPromptTemplate.from_template("Explain {topic}")
    | ChatOpenAI(model="gpt-4o-mini")
    | StrOutputParser()
)

# This call is automatically traced in LangSmith
result = chain.invoke({"topic": "quantum computing"})

Every invocation creates a run in the LangSmith dashboard. You can see the input, output, latency, token usage, and cost for each component in the chain.

Understanding Run Trees

A run tree is a hierarchical view of a single invocation. For a RAG chain, the tree might look like:

Chain Run (root)
- Retriever Run — query, returned documents, latency
- Prompt Run — template variables, formatted prompt
- LLM Run — model, temperature, prompt tokens, completion tokens, response
- Parser Run — raw input, parsed output

Each node shows timing, inputs, outputs, and any errors. This lets you identify bottlenecks (slow retrievals) and failures (parsing errors) instantly.

Custom Tracing with the @traceable Decorator

Trace any Python function, not just LangChain components.

from langsmith import traceable

@traceable(name="process_order")
def process_order(order_id: str, items: list[str]) -> dict:
    # Your business logic
    total = calculate_total(items)
    result = submit_order(order_id, items, total)
    return {"order_id": order_id, "total": total, "status": result}

# This function call appears as a traced run in LangSmith
process_order("ORD-123", ["widget", "gadget"])

@traceable functions nest correctly inside LangChain traces. If a traced function calls a LangChain chain, both appear in the same run tree.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Building Evaluation Datasets

LangSmith lets you create datasets of input-output examples for systematic evaluation.

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    "customer-support-qa",
    description="Questions and expected answers for customer support bot",
)

# Add examples
client.create_examples(
    inputs=[
        {"question": "How do I reset my password?"},
        {"question": "What is your refund policy?"},
        {"question": "How do I cancel my subscription?"},
    ],
    outputs=[
        {"answer": "Go to Settings > Security > Reset Password."},
        {"answer": "Full refund within 30 days of purchase."},
        {"answer": "Go to Settings > Subscription > Cancel."},
    ],
    dataset_id=dataset.id,
)

You can also create datasets from traced runs — select successful or failed runs from the dashboard and convert them into evaluation examples.

Running Evaluations

Evaluators score your chain's outputs against expected results.

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Define evaluators
correctness = LangChainStringEvaluator("qa")  # LLM-based QA evaluator
relevance = LangChainStringEvaluator("criteria", config={
    "criteria": "relevance",
})

# Run evaluation
def predict(inputs: dict) -> dict:
    result = chain.invoke(inputs)
    return {"answer": result}

results = evaluate(
    predict,
    data="customer-support-qa",
    evaluators=[correctness, relevance],
    experiment_prefix="v1-gpt4o-mini",
)

print(results.to_pandas())

Each evaluation run creates an experiment in LangSmith. You can compare experiments side by side to measure the impact of prompt changes, model upgrades, or chain modifications.

Custom Evaluators

Write evaluators for domain-specific quality criteria.

from langsmith.schemas import Run, Example

def check_citation(run: Run, example: Example) -> dict:
    """Check if the response cites a source."""
    output = run.outputs.get("answer", "")
    has_citation = "source:" in output.lower() or "[" in output
    return {
        "key": "has_citation",
        "score": 1 if has_citation else 0,
    }

results = evaluate(
    predict,
    data="customer-support-qa",
    evaluators=[check_citation],
)

Collecting Human Feedback

LangSmith supports feedback collection on individual runs via the API or dashboard.

from langsmith import Client

client = Client()

# After a user rates a response
client.create_feedback(
    run_id="run-uuid-here",
    key="user_rating",
    score=1,  # thumbs up
    comment="Accurate and helpful",
)

Feedback data powers fine-tuning datasets and helps identify where your application needs improvement.

FAQ

Is LangSmith required to use LangChain?

No. LangSmith is optional. LangChain works without it. But for any production application, the observability and evaluation capabilities are essential for debugging issues, measuring quality, and iterating on prompts and chains.

What does LangSmith cost?

LangSmith has a free tier with limited trace retention. Paid tiers offer longer retention, more traces, and team collaboration features. Check the current pricing at smith.langchain.com.

Can I use LangSmith with non-LangChain code?

Yes. The @traceable decorator and the LangSmith SDK work with any Python code. You can trace raw OpenAI API calls, custom HTTP requests, or any function. LangSmith is not limited to LangChain applications.

#LangSmith #Observability #LLMEvaluation #Debugging #Python #AgenticAI #LearnAI #AIEngineering