Direct Preference Optimization: Your Language Model is Secretly a Reward Model



This content originally appeared on DEV Community and was authored by Takara Taniguchi

Rafael Rafailovが第一著者,Stanford

The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.

RL fine-tuning is conducted as follows:

Image description\

Using the partition function

Image description

We can delete Z(xx, which is difficult to calculate

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i40dsvgp9jyboh93n2at.png

Then, we do not have to make reward modeling and directly optimize the loss function.


This content originally appeared on DEV Community and was authored by Takara Taniguchi