Hugging Face Transformers for Agent Development: Loading and Running Models

Hugging Face Transformers: The Foundation Layer

The transformers library from Hugging Face is the most widely used interface for loading and running open-source language models. While higher-level serving frameworks like vLLM and Ollama build on top of it, understanding Transformers directly gives you full control over model behavior — which is essential when debugging agent issues or customizing inference.

For agent developers, Transformers provides the building blocks: loading any model from the Hub, applying chat templates for instruction-tuned models, controlling generation parameters precisely, and integrating with quantization libraries.

Loading a Model and Tokenizer

Every model interaction starts with loading the model weights and its tokenizer:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use FP16 to save memory
    device_map="auto",          # Automatically distribute across GPUs
)

The device_map="auto" parameter uses Hugging Face Accelerate to distribute model layers across available GPUs and CPU RAM. For a model that fits in a single GPU, it places everything on cuda:0. For larger models, it splits layers across devices.

The Pipeline API: Quick Start for Inference

The pipeline API provides a high-level interface that handles tokenization, generation, and decoding in one call:

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a linked list."},
]

output = generator(
    messages,
    max_new_tokens=512,
    temperature=0.3,
    do_sample=True,
)

print(output[0]["generated_text"][-1]["content"])

The pipeline automatically applies the model's chat template to format the messages correctly, which is critical — different models use different special tokens and formatting conventions.

Chat Templates: Getting the Format Right

Instruction-tuned models are trained with specific prompt formats. Using the wrong format dramatically reduces model quality. Transformers handles this through chat templates stored in the tokenizer:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are an agent that answers questions concisely."},
    {"role": "user", "content": "What is PagedAttention?"},
]

# Apply the chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print(formatted)
# Shows the exact format the model expects, including special tokens

The add_generation_prompt=True parameter appends the assistant turn prefix, telling the model to start generating a response. Omitting this is a common bug that causes models to continue the user's message instead of responding to it.

Fine-Grained Generation Control

For agent applications, you need precise control over how the model generates text. The generate method exposes all the knobs:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)

messages = [
    {"role": "system", "content": "You respond with JSON only."},
    {"role": "user", "content": "Extract the name and age from: John is 30 years old."},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.1,      # Low temperature for deterministic agent behavior
    top_p=0.9,            # Nucleus sampling threshold
    repetition_penalty=1.1,  # Penalize repeated tokens
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

# Decode only the generated tokens (exclude the prompt)
response = tokenizer.decode(
    outputs[0][inputs.shape[-1]:],
    skip_special_tokens=True,
)
print(response)

Key generation parameters for agents:

temperature=0.1-0.3: Keeps agent outputs consistent and predictable
repetition_penalty=1.1: Prevents the model from getting stuck in loops
max_new_tokens: Set this based on your expected output length to save compute

Streaming for Responsive Agents

Agents that interact with users benefit from streaming output. Use the TextStreamer or TextIteratorStreamer for real-time token output:

from transformers import TextIteratorStreamer
from threading import Thread

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)

generation_kwargs = {
    "input_ids": inputs,
    "max_new_tokens": 512,
    "streamer": streamer,
    "temperature": 0.7,
    "do_sample": True,
}

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for text_chunk in streamer:
    print(text_chunk, end="", flush=True)

thread.join()

Building an Agent Loop with Transformers

Here is a minimal agent loop that processes tools using Transformers directly:

import json

def agent_generate(model, tokenizer, messages, max_tokens=512):
    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True
    ).to(model.device)

    outputs = model.generate(
        inputs,
        max_new_tokens=max_tokens,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    return tokenizer.decode(
        outputs[0][inputs.shape[-1]:], skip_special_tokens=True
    )

def run_agent(model, tokenizer, user_query: str):
    messages = [
        {"role": "system", "content": "You are a helpful agent. "
         "If you need to calculate something, output JSON: "
         '{"tool": "calculate", "expression": "..."}'},
        {"role": "user", "content": user_query},
    ]

    for step in range(5):  # Max 5 agent steps
        response = agent_generate(model, tokenizer, messages)
        messages.append({"role": "assistant", "content": response})

        if '{"tool"' in response:
            tool_call = json.loads(response)
            result = str(eval(tool_call["expression"]))
            messages.append({"role": "user", "content": f"Result: {result}"})
        else:
            return response

    return response

FAQ

When should I use Transformers directly versus Ollama or vLLM?

Use Transformers directly when you need fine-grained control over generation, are integrating custom model architectures, or are doing research. Use Ollama for simple local development. Use vLLM for production serving. Many developers prototype with Transformers, then deploy with vLLM.

How do I load a model that does not fit in GPU memory?

Use device_map="auto" with the Accelerate library. It will split the model across GPU and CPU RAM automatically. Alternatively, load a quantized version using BitsAndBytesConfig for 4-bit or 8-bit loading directly within Transformers.

Why does my model generate garbage after switching from one model to another?

Each model has a unique chat template. If you switch from Llama to Mistral, the prompt format changes. Always use tokenizer.apply_chat_template() rather than manually constructing prompts. This ensures the correct format regardless of which model you load.

#HuggingFace #Transformers #ModelLoading #Python #AgentDevelopment #AgenticAI #LearnAI #AIEngineering

Hugging Face Transformers for Agent Development: Loading and Running Models

Hugging Face Transformers: The Foundation Layer

Loading a Model and Tokenizer

The Pipeline API: Quick Start for Inference

Chat Templates: Getting the Format Right

Fine-Grained Generation Control

Streaming for Responsive Agents

Building an Agent Loop with Transformers

FAQ

When should I use Transformers directly versus Ollama or vLLM?

How do I load a model that does not fit in GPU memory?

Why does my model generate garbage after switching from one model to another?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding