RLHF, DPO, Constitutional AI, reward models, and safety evaluation for aligned LLMs.
Build a reward model using pairwise ranking loss. The backbone of RLHF pipelines.
Full RLHF pipeline with Proximal Policy Optimization using TRL library.
Direct Preference Optimization — alignment without a reward model. Simpler than RLHF.
Anthropic's self-critique and revision approach: use the model to evaluate and improve its own outputs.
Continued pre-training on domain corpora (medical, legal, code) before task-specific fine-tuning.
Train one model on multiple tasks simultaneously with task mixing and loss balancing.
EWC (Elastic Weight Consolidation) and replay buffers to preserve performance on old tasks.
In-batch negatives, hard negative mining, and curriculum negatives for retrieval models.
Toxicity detection, bias benchmarks, and jailbreak testing for responsible AI deployment.
MoE architecture: sparse gating, top-k routing, and expert load balancing.