The Attention Mechanism — The Heart of Every LLM
Before transformers, sequential models (RNNs, LSTMs) processed text one token at a time. This made parallelization impossible and caused vanishing gradients over long sequences. The 2017 paper "Attention is All You Need" replaced recurrence entirely with self-attention.
Self-Attention Intuition
In the sentence "The animal didn't cross the street because it was too tired" — what does "it" refer to? A human reads the whole sentence and connects "it" to "animal." Self-attention gives the model a mechanism to do exactly this: every token attends to every other token and computes a context-aware representation.
The Attention Formula
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
# Compute similarity scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# Apply causal mask for autoregressive generation
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax over key dimension
attn_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
return torch.matmul(attn_weights, V), attn_weights
Why Positional Encoding?
Attention is permutation-invariant — "cat sat mat" and "mat sat cat" produce identical attention patterns without positional information. Positional encodings inject absolute or relative position information using sinusoidal functions (original transformer) or learned embeddings (BERT, GPT). Modern LLMs use Rotary Position Embeddings (RoPE) which enable length generalization.
Encoder vs Decoder Architectures
- Encoder-only (BERT, RoBERTa): Bidirectional attention — each token sees all others. Best for classification, NER, embeddings.
- Decoder-only (GPT, LLaMA): Causal attention — each token only sees previous tokens. Best for text generation.
- Encoder-Decoder (T5, BART): Full attention in encoder, cross-attention in decoder. Best for seq2seq tasks like translation and summarization.