Stage 03

Advanced Optimization

FlashAttention, DeepSpeed ZeRO, FSDP, gradient checkpointing, and instruction tuning at scale.

10notebooks
7hestimated
3040 min

Flash Attention

FlashAttention vs standard attention: IO-aware CUDA kernel that reduces memory from O(N²) to O(N).

FlashAttentionIO-AwareMemory EfficiencyCUDA Kernels
3150 min

DeepSpeed ZeRO

ZeRO Stage 1/2/3: partition optimizer state, gradients, and parameters across GPUs.

DeepSpeedZeRODistributed TrainingMemory Partitioning
3245 min

FSDP PyTorch

PyTorch native Fully Sharded Data Parallel — the open-source alternative to DeepSpeed ZeRO-3.

FSDPShardingDistributed TrainingPyTorch DDP
3335 min

Gradient Checkpointing

Trade compute for memory: recompute activations during backward pass instead of storing them.

Gradient CheckpointingActivation RecomputationMemory-Compute Tradeoff
3440 min

Optimizer Comparison

AdamW vs Lion vs Sophia. Understanding adaptive vs non-adaptive optimizers for LLM training.

AdamWLionSophiaOptimizer+1
3535 min

LR Schedules

Warmup + cosine decay, linear decay, and cyclical schedules. Why warmup is essential for LLMs.

Cosine DecayLinear WarmupCyclical LRLR Schedule
3640 min

Advanced Data Loading

Sequence packing, dynamic batching, and streaming datasets for memory-efficient training.

Sequence PackingDynamic BatchingStreamingData Pipeline
3745 min

Instruction Tuning

Alpaca, Dolly, and ShareGPT data formats. Fine-tune models to follow instructions.

Instruction TuningAlpaca FormatChatMLFLAN
3845 min

Long Context Training

RoPE scaling and YaRN for extending context length beyond the original training window.

RoPE ScalingYaRNLong ContextPosition Encoding
3940 min

Contrastive Learning

SimCSE, triplet loss, and hard negative mining for embedding models and retrieval.

SimCSETriplet LossHard NegativesEmbeddings+1
← Previous
Parameter-Efficient Fine-Tuning
Next →
Alignment & Specialized Techniques