DPO Implementation — Alignment & Specialized Techniques | LLM Dojo

Why DPO Instead of RLHF/PPO?

Traditional RLHF requires three steps: (1) train a reward model on preferences, (2) use PPO to optimize the policy against the reward model, (3) carefully tune KL penalties to prevent reward hacking. This is complex, unstable, and requires 3 models in memory simultaneously. DPO collapses this to a single supervised step.

RLHF vs DPO Comparison

	RLHF/PPO	DPO
Reward model needed?	Yes (separate training)	No
Models in memory	4 (actor, critic, ref, reward)	2 (policy + reference)
Training stability	Difficult	Straightforward
Data needed	Reward model data + preference data	Preference pairs only

DPO Training Setup

from trl import DPOTrainer, DPOConfig

# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
training_args = DPOConfig(
    beta=0.1,              # KL penalty coefficient
    output_dir="./dpo-out",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,    # much lower than SFT — fine-grained alignment
    fp16=True,
)

trainer = DPOTrainer(
    model=policy_model,
    ref_model=reference_model,  # frozen copy of SFT model
    args=training_args,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Key Takeaways

01DPO eliminates the need for a separate reward model and PPO training loop

02DPO directly optimizes the policy from preference pairs (chosen, rejected)

03The implicit reward is defined as r(x,y) = β·log(π(y|x)/π_ref(y|x))

04β controls the KL divergence penalty — higher β = stay closer to reference model

05DPO requires a frozen reference model (copy of SFT model) during training

Core Concepts

DPO Loss Function

DPO (Rafailov et al. 2023) shows that the RLHF objective has a closed-form solution. The optimal policy can be found directly by optimizing: L_DPO = -E[log σ(β·(log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))] where y_w is the preferred response and y_l is the rejected response.

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
            ref_chosen_logps, ref_rejected_logps, beta=0.1):
    pi_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = ref_chosen_logps - ref_rejected_logps
    return -F.logsigmoid(beta * (pi_logratios - ref_logratios)).mean()