Direct Preference Optimization: Your Language Model is Secretly a Reward Model

May 28, 2025

This content originally appeared on DEV Community and was authored by Takara Taniguchi

Rafael Rafailovが第一著者，Stanford

The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.

RL fine-tuning is conducted as follows:

Using the partition function

We can delete Z(xx, which is difficult to calculate

Then, we do not have to make reward modeling and directly optimize the loss function.

This content originally appeared on DEV Community and was authored by Takara Taniguchi