Stage 07

2024–2025 Techniques

GRPO reasoning models, ORPO/KTO alignment, Unsloth acceleration, SGLang serving, synthetic data pipelines, model merging, and standardized evaluation.

9notebooks
6hestimated
6040 min

ORPO: Reference-Free Alignment

Monolithic Preference Optimization combines SFT + preference in one loss — no reference model needed, 50% less memory than DPO.

ORPOPreference OptimizationReference-FreeOdds Ratio+1
6135 min

KTO: Binary Feedback Alignment

Kahneman-Tversky Optimization aligns models from binary thumbs-up/thumbs-down signals — no paired data required.

KTOBinary FeedbackProspect TheoryTRL+1
6255 min

GRPO: Reasoning Model Training

Group Relative Policy Optimization — the technique behind DeepSeek-R1. Train reasoning models with verifiable rewards, no critic network.

GRPOReasoning ModelsDeepSeek-R1Verifiable Rewards+1
6345 min

Preference Algorithm Comparison

DPO vs ORPO vs KTO vs IPO vs SimPO — which alignment algorithm to use when. Side-by-side comparison on the same dataset.

DPOORPOKTOIPO+2
6440 min

LM Evaluation Harness

Evaluate models on MMLU, ARC, HellaSwag, GSM8K using EleutherAI's lm-evaluation-harness — the standard for the HuggingFace Open LLM Leaderboard.

lm-evalMMLUARCHellaSwag+2
6540 min

Unsloth: 2× Faster Fine-Tuning

Unsloth's custom Triton kernels deliver 2× training speed and 70% less VRAM. Drop-in replacement for standard HuggingFace fine-tuning.

UnslothTriton KernelsMemory EfficiencyFastLanguageModel+1
6645 min

SGLang: Production Inference

SGLang's RadixAttention and zero-overhead scheduler deliver higher throughput than vLLM. Used at xAI, NVIDIA, AMD at 400K+ GPU scale.

SGLangRadixAttentionPrefix CachingThroughput+1
6745 min

Synthetic Data Generation

LLM-as-annotator, self-instruct, and knowledge distillation pipelines. Build high-quality fine-tuning datasets for under $5.

Synthetic DataSelf-InstructLLM-as-AnnotatorDistilabel+1
6840 min

Model Merging with MergeKit

Combine specialized fine-tuned models using SLERP, TIES, and DARE algorithms. No GPU required — runs on CPU.

MergeKitTIESDARESLERP+2
← Previous
LLM Inference Optimization