The Transformer Architecture Explained: Attention Is All You Need

Why the Transformer Changed Everything

Before 2017, the dominant architectures for language processing were recurrent neural networks (RNNs) and their variants (LSTMs, GRUs). These models processed text sequentially — one word at a time, left to right. This sequential nature created two fundamental problems: training was slow because you could not parallelize across the sequence, and long-range dependencies were hard to learn because information had to pass through every intermediate step.

The transformer, introduced in the paper "Attention Is All You Need" by Vaswani et al., solved both problems by replacing recurrence entirely with attention mechanisms. Every modern LLM — GPT-4, Claude, Gemini, Llama, Mistral — is built on transformers.

The Core Idea: Self-Attention

Self-attention lets every token in a sequence look at every other token to decide what information is relevant. Consider the sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? A human knows "it" refers to "the animal." Self-attention computes this by having the "it" token attend strongly to the "animal" token.

Here is self-attention implemented from scratch:

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def self_attention(X, W_Q, W_K, W_V):
    """
    Scaled dot-product self-attention.

    X: input matrix (seq_len x d_model) — one row per token
    W_Q, W_K, W_V: learned projection matrices (d_model x d_k)
    """
    Q = X @ W_Q   # Queries: what is each token looking for?
    K = X @ W_K   # Keys: what does each token advertise?
    V = X @ W_V   # Values: what information does each token carry?

    d_k = K.shape[-1]

    # Attention scores: how much should token i attend to token j?
    scores = Q @ K.T / np.sqrt(d_k)  # (seq_len x seq_len)

    # Convert to probabilities
    attention_weights = softmax(scores)  # rows sum to 1

    # Weighted combination of values
    output = attention_weights @ V  # (seq_len x d_k)

    return output, attention_weights

# Example: 4 tokens, embedding dimension 8, head dimension 4
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 4

X = np.random.randn(seq_len, d_model)  # 4 token embeddings
W_Q = np.random.randn(d_model, d_k)
W_K = np.random.randn(d_model, d_k)
W_V = np.random.randn(d_model, d_k)

output, weights = self_attention(X, W_Q, W_K, W_V)
print("Attention weights (each row shows how much a token attends to each other token):")
print(np.round(weights, 3))

The scores = Q @ K.T / sqrt(d_k) line is the heart of the transformer. The division by the square root of the key dimension prevents the dot products from becoming too large, which would cause the softmax to produce near-one-hot distributions that make gradients vanish.

Multi-Head Attention: Looking at Multiple Things Simultaneously

A single attention head can only focus on one type of relationship at a time. Multi-head attention runs multiple attention heads in parallel, each learning to focus on different aspects of the input:

def multi_head_attention(X, n_heads, d_model):
    """
    Multi-head attention splits the model dimension across heads,
    runs attention independently, then concatenates results.
    """
    d_k = d_model // n_heads
    heads = []

    for h in range(n_heads):
        # Each head has its own Q, K, V projections
        W_Q = np.random.randn(d_model, d_k)
        W_K = np.random.randn(d_model, d_k)
        W_V = np.random.randn(d_model, d_k)

        head_output, _ = self_attention(X, W_Q, W_K, W_V)
        heads.append(head_output)

    # Concatenate all heads: (seq_len, n_heads * d_k) = (seq_len, d_model)
    concatenated = np.concatenate(heads, axis=-1)

    # Final linear projection
    W_O = np.random.randn(d_model, d_model)
    output = concatenated @ W_O

    return output

# 4 tokens, 8-dimensional embeddings, 2 attention heads
result = multi_head_attention(X, n_heads=2, d_model=8)
print(f"Multi-head output shape: {result.shape}")  # (4, 8)

In practice, one head might learn to attend to syntactic relationships (subject-verb agreement), another to semantic relationships (pronoun resolution), and another to positional proximity. GPT-4 uses 96 attention heads per layer across 120 layers.

Positional Encoding: Teaching Order to a Parallel System

Since self-attention processes all tokens simultaneously, it has no inherent notion of word order. "The cat chased the dog" and "The dog chased the cat" would produce identical attention patterns without positional information. Positional encoding solves this by adding a position signal to each token embedding:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def sinusoidal_positional_encoding(seq_len, d_model):
    """
    The original positional encoding from the transformer paper.
    Uses sine and cosine functions at different frequencies.
    """
    positions = np.arange(seq_len)[:, np.newaxis]     # (seq_len, 1)
    dimensions = np.arange(d_model)[np.newaxis, :]     # (1, d_model)

    # Compute angle rates
    angle_rates = 1 / np.power(10000, (2 * (dimensions // 2)) / d_model)
    angle_rads = positions * angle_rates

    # Apply sin to even indices, cos to odd indices
    pe = np.zeros((seq_len, d_model))
    pe[:, 0::2] = np.sin(angle_rads[:, 0::2])
    pe[:, 1::2] = np.cos(angle_rads[:, 1::2])

    return pe

pe = sinusoidal_positional_encoding(seq_len=10, d_model=8)
print("Positional encoding shape:", pe.shape)  # (10, 8)
print("Position 0:", np.round(pe[0], 3))
print("Position 1:", np.round(pe[1], 3))

# The token embedding + positional encoding = model input
# token_input = token_embedding + positional_encoding

Modern models often use learned positional embeddings (a trainable vector per position) or Rotary Position Embeddings (RoPE), which encode relative positions directly into the attention computation. RoPE is used by Llama, Mistral, and many recent models because it generalizes better to sequence lengths not seen during training.

The Full Transformer Block

A complete transformer block combines multi-head attention with a feed-forward network, layer normalization, and residual connections:

def transformer_block(X, n_heads, d_model, d_ff):
    """
    One transformer block:
    1. Multi-head self-attention with residual connection and layer norm
    2. Feed-forward network with residual connection and layer norm
    """
    # Sub-layer 1: Multi-head attention
    attn_output = multi_head_attention(X, n_heads, d_model)
    X = layer_norm(X + attn_output)  # Residual connection + LayerNorm

    # Sub-layer 2: Feed-forward network (expand then compress)
    ff_output = feed_forward(X, d_model, d_ff)
    X = layer_norm(X + ff_output)    # Residual connection + LayerNorm

    return X

def feed_forward(X, d_model, d_ff):
    """Position-wise feed-forward network."""
    W1 = np.random.randn(d_model, d_ff)
    b1 = np.zeros(d_ff)
    W2 = np.random.randn(d_ff, d_model)
    b2 = np.zeros(d_model)

    # Expand to higher dimension, apply ReLU, compress back
    hidden = np.maximum(0, X @ W1 + b1)  # ReLU activation
    output = hidden @ W2 + b2
    return output

def layer_norm(X, eps=1e-5):
    """Layer normalization."""
    mean = np.mean(X, axis=-1, keepdims=True)
    var = np.var(X, axis=-1, keepdims=True)
    return (X - mean) / np.sqrt(var + eps)

The feed-forward network is where the model stores factual knowledge. Research has shown that specific neurons in the feed-forward layers activate for specific facts — effectively acting as a distributed key-value memory.

Encoder-Decoder vs Decoder-Only

The original transformer had both an encoder and a decoder. Modern LLMs diverge into two camps:

Encoder-decoder models (T5, BART) process the input with the encoder, then generate the output with the decoder. The decoder attends to both its own previous outputs and the encoder's output through cross-attention. These are strong for translation and summarization.

Decoder-only models (GPT, Claude, Llama) use only the decoder with causal masking — each token can only attend to tokens that came before it. This is the architecture behind every major conversational LLM today:

def causal_self_attention(X, W_Q, W_K, W_V):
    """Self-attention with causal mask — tokens cannot see the future."""
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V

    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)

    # Causal mask: set future positions to -infinity
    seq_len = X.shape[0]
    mask = np.triu(np.ones((seq_len, seq_len)), k=1) * (-1e9)
    scores = scores + mask

    attention_weights = softmax(scores)
    output = attention_weights @ V
    return output, attention_weights

The causal mask ensures that when generating token 5, the model can only see tokens 1 through 4. This is what makes autoregressive generation work — the model predicts one token at a time without peeking at the answer.

FAQ

Why is self-attention O(n squared) and why does this matter?

Self-attention computes a score between every pair of tokens, which means computation grows quadratically with sequence length. For a 1,000-token sequence, that is 1 million attention scores. For 100,000 tokens, that is 10 billion. This is why context window sizes were historically limited and why techniques like sparse attention, sliding window attention, and flash attention have been developed to reduce this cost.

What is the difference between attention and self-attention?

Attention in general means one sequence attending to another — for example, a decoder attending to an encoder's output (cross-attention). Self-attention means a sequence attending to itself. In a decoder-only model like GPT, the primary mechanism is causal self-attention, where each token attends to all previous tokens in the same sequence.

How many parameters does a transformer actually have?

The parameters come primarily from the Q, K, V projection matrices in each attention head, the output projection matrix, the two matrices in the feed-forward network, and the layer normalization parameters. For a model with L layers, d_model embedding dimension, and d_ff feed-forward dimension, the rough parameter count is L times (4 times d_model squared plus 2 times d_model times d_ff). GPT-3 with 96 layers and d_model of 12,288 reaches 175 billion through this formula.

#Transformers #SelfAttention #NeuralNetworks #LLMArchitecture #DeepLearning #AgenticAI #LearnAI #AIEngineering

The Transformer Architecture Explained: Attention Is All You Need

Why the Transformer Changed Everything

The Core Idea: Self-Attention

Multi-Head Attention: Looking at Multiple Things Simultaneously

Positional Encoding: Teaching Order to a Parallel System

The Full Transformer Block

Encoder-Decoder vs Decoder-Only

FAQ

Why is self-attention O(n squared) and why does this matter?

What is the difference between attention and self-attention?

How many parameters does a transformer actually have?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding