20Stage 0245 min

LoRA Theory

Understand LoRA's low-rank matrix decomposition: W = W₀ + BA·(α/r). Implement from scratch.

LoRALow-Rank DecompositionRankAlpha Scaling

The Problem LoRA Solves

Full fine-tuning a 7B parameter model requires storing: the model itself (~14 GB in FP16), gradients (~14 GB), optimizer state (~56 GB for AdamW). Total: ~84 GB — beyond most single-GPU setups. LoRA reduces trainable parameters by 10,000x while matching full fine-tuning quality.

Mathematical Intuition

The hypothesis: the fine-tuning weight update ΔW lives in a low-dimensional subspace of the full weight space. Even for a 4096×4096 weight matrix, the actual meaningful change during fine-tuning has an intrinsic rank much lower than 4096. LoRA exploits this by parameterizing ΔW = BA where both B and A are thin matrices.

Why B=0 Initialization?

B is initialized to zero so that at the start of training, the LoRA update ΔW = BA = 0, leaving the pre-trained model unchanged. This ensures stable training from a good initialization point rather than a random perturbation.

Applying LoRA with HuggingFace PEFT

from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    r=8,                          # rank — higher = more capacity
    lora_alpha=16,                # scaling factor α
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Rank Selection Guide

  • r=4: Minimal capacity. Good for simple style/format adaptation.
  • r=8: Standard. Best quality/parameter tradeoff for most tasks.
  • r=16-32: Higher capacity. For complex task adaptation or small datasets.
  • r=64+: Approaches full fine-tuning. Use if LoRA quality is insufficient.

Key Takeaways

  • 01LoRA decomposes weight updates as ΔW = BA where B ∈ R^(d×r), A ∈ R^(r×k), r ≪ min(d,k)
  • 02Typical LoRA trains only 0.1–1% of parameters vs full fine-tuning
  • 03Alpha scaling (α/r) controls the magnitude of LoRA updates
  • 04B is initialized to zero so LoRA starts as identity — stable training from pretrained weights
  • 05Target attention layers (q_proj, v_proj) for best quality/parameter tradeoff

Core Concepts

Low-Rank Decomposition

The core insight: pre-trained model weight matrices W₀ are high-rank but the task-specific update ΔW has low intrinsic rank. LoRA constrains ΔW = BA where rank r ≪ d, reducing parameters from d×k to (d+k)×r. For a 4096×4096 matrix with r=8: 16M → 65K parameters (99.6% reduction).

# Standard linear: W ∈ R^(4096×4096) = 16M params
# LoRA: B ∈ R^(4096×8), A ∈ R^(8×4096) = 65K params
class LoRALinear(nn.Module):
    def __init__(self, d, k, r=8, alpha=16):
        super().__init__()
        self.W0 = nn.Parameter(torch.randn(d, k), requires_grad=False)  # frozen
        self.A = nn.Parameter(torch.randn(r, k) * 0.01)  # trainable
        self.B = nn.Parameter(torch.zeros(d, r))           # zero init
        self.scale = alpha / r
    def forward(self, x):
        return x @ self.W0.T + (x @ self.A.T @ self.B.T) * self.scale