This content originally appeared on DEV Community and was authored by Takara Taniguchi
Rafael Rafailovが第一著者,Stanford
The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.
RL fine-tuning is conducted as follows:
Using the partition function
We can delete Z(xx, which is difficult to calculate
![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i40dsvgp9jyboh93n2at.png
Then, we do not have to make reward modeling and directly optimize the loss function.
This content originally appeared on DEV Community and was authored by Takara Taniguchi