FlashAttention, DeepSpeed ZeRO, FSDP, gradient checkpointing, and instruction tuning at scale.
FlashAttention vs standard attention: IO-aware CUDA kernel that reduces memory from O(N²) to O(N).
ZeRO Stage 1/2/3: partition optimizer state, gradients, and parameters across GPUs.
PyTorch native Fully Sharded Data Parallel — the open-source alternative to DeepSpeed ZeRO-3.
Trade compute for memory: recompute activations during backward pass instead of storing them.
AdamW vs Lion vs Sophia. Understanding adaptive vs non-adaptive optimizers for LLM training.
Warmup + cosine decay, linear decay, and cyclical schedules. Why warmup is essential for LLMs.
Sequence packing, dynamic batching, and streaming datasets for memory-efficient training.
Alpaca, Dolly, and ShareGPT data formats. Fine-tune models to follow instructions.
RoPE scaling and YaRN for extending context length beyond the original training window.
SimCSE, triplet loss, and hard negative mining for embedding models and retrieval.