Transformer Basics — Foundations & Environment | LLM Dojo

The Attention Mechanism — The Heart of Every LLM

Before transformers, sequential models (RNNs, LSTMs) processed text one token at a time. This made parallelization impossible and caused vanishing gradients over long sequences. The 2017 paper "Attention is All You Need" replaced recurrence entirely with self-attention.

Self-Attention Intuition

In the sentence "The animal didn't cross the street because it was too tired" — what does "it" refer to? A human reads the whole sentence and connects "it" to "animal." Self-attention gives the model a mechanism to do exactly this: every token attends to every other token and computes a context-aware representation.

The Attention Formula

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    # Compute similarity scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    # Apply causal mask for autoregressive generation
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Softmax over key dimension
    attn_weights = F.softmax(scores, dim=-1)

    # Weighted sum of values
    return torch.matmul(attn_weights, V), attn_weights

Why Positional Encoding?

Attention is permutation-invariant — "cat sat mat" and "mat sat cat" produce identical attention patterns without positional information. Positional encodings inject absolute or relative position information using sinusoidal functions (original transformer) or learned embeddings (BERT, GPT). Modern LLMs use Rotary Position Embeddings (RoPE) which enable length generalization.

Encoder vs Decoder Architectures

Encoder-only (BERT, RoBERTa): Bidirectional attention — each token sees all others. Best for classification, NER, embeddings.
Decoder-only (GPT, LLaMA): Causal attention — each token only sees previous tokens. Best for text generation.
Encoder-Decoder (T5, BART): Full attention in encoder, cross-attention in decoder. Best for seq2seq tasks like translation and summarization.

Key Takeaways

01Attention computes a weighted sum of values, where weights come from query-key similarity

02Multi-head attention runs h parallel attention heads and concatenates results

03Positional encoding injects sequence order since attention is permutation-invariant

04Transformer compute scales as O(N²·d) with sequence length N — the core bottleneck

05Flash Attention reduces memory from O(N²) to O(N) via tiling — essential for long context

Core Concepts

Scaled Dot-Product Attention

The core operation: Attention(Q,K,V) = softmax(QK^T / √d_k) · V. The √d_k scaling prevents softmax saturation for large embedding dimensions. Without scaling, the dot products grow large, pushing softmax into regions with near-zero gradients.

def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k**0.5
    return F.softmax(scores, dim=-1) @ V

Multi-Head Attention

Instead of one attention function, run h attention heads in parallel on projected subspaces. Each head can attend to different positional relationships. Outputs are concatenated and linearly projected: MultiHead(Q,K,V) = Concat(head_1,...,head_h) · W_O.

# h=8 heads, d_model=512, d_k = d_model/h = 64
Q = W_q(x).view(B, T, h, d_k).transpose(1, 2)  # (B, h, T, d_k)
K = W_k(x).view(B, T, h, d_k).transpose(1, 2)
V = W_v(x).view(B, T, h, d_k).transpose(1, 2)