42Stage 0450 min

DPO Implementation

Direct Preference Optimization — alignment without a reward model. Simpler than RLHF.

DPODirect Preference OptimizationPreference LearningRafailov 2023

Why DPO Instead of RLHF/PPO?

Traditional RLHF requires three steps: (1) train a reward model on preferences, (2) use PPO to optimize the policy against the reward model, (3) carefully tune KL penalties to prevent reward hacking. This is complex, unstable, and requires 3 models in memory simultaneously. DPO collapses this to a single supervised step.

RLHF vs DPO Comparison

RLHF/PPODPO
Reward model needed?Yes (separate training)No
Models in memory4 (actor, critic, ref, reward)2 (policy + reference)
Training stabilityDifficultStraightforward
Data neededReward model data + preference dataPreference pairs only

DPO Training Setup

from trl import DPOTrainer, DPOConfig

# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
training_args = DPOConfig(
    beta=0.1,              # KL penalty coefficient
    output_dir="./dpo-out",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,    # much lower than SFT — fine-grained alignment
    fp16=True,
)

trainer = DPOTrainer(
    model=policy_model,
    ref_model=reference_model,  # frozen copy of SFT model
    args=training_args,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Key Takeaways

  • 01DPO eliminates the need for a separate reward model and PPO training loop
  • 02DPO directly optimizes the policy from preference pairs (chosen, rejected)
  • 03The implicit reward is defined as r(x,y) = β·log(π(y|x)/π_ref(y|x))
  • 04β controls the KL divergence penalty — higher β = stay closer to reference model
  • 05DPO requires a frozen reference model (copy of SFT model) during training

Core Concepts

DPO Loss Function

DPO (Rafailov et al. 2023) shows that the RLHF objective has a closed-form solution. The optimal policy can be found directly by optimizing: L_DPO = -E[log σ(β·(log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))] where y_w is the preferred response and y_l is the rejected response.

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
            ref_chosen_logps, ref_rejected_logps, beta=0.1):
    pi_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = ref_chosen_logps - ref_rejected_logps
    return -F.logsigmoid(beta * (pi_logratios - ref_logratios)).mean()