Why DPO Instead of RLHF/PPO?
Traditional RLHF requires three steps: (1) train a reward model on preferences, (2) use PPO to optimize the policy against the reward model, (3) carefully tune KL penalties to prevent reward hacking. This is complex, unstable, and requires 3 models in memory simultaneously. DPO collapses this to a single supervised step.
RLHF vs DPO Comparison
| RLHF/PPO | DPO | |
|---|---|---|
| Reward model needed? | Yes (separate training) | No |
| Models in memory | 4 (actor, critic, ref, reward) | 2 (policy + reference) |
| Training stability | Difficult | Straightforward |
| Data needed | Reward model data + preference data | Preference pairs only |
DPO Training Setup
from trl import DPOTrainer, DPOConfig
# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
training_args = DPOConfig(
beta=0.1, # KL penalty coefficient
output_dir="./dpo-out",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7, # much lower than SFT — fine-grained alignment
fp16=True,
)
trainer = DPOTrainer(
model=policy_model,
ref_model=reference_model, # frozen copy of SFT model
args=training_args,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
trainer.train()