This content originally appeared on DEV Community and was authored by Takara Taniguchi
Rafael Rafailovが第一著者,Stanford
The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.
RL fine-tuning is conducted as follows:
Using the partition function
We can delete Z(xx, which is difficult to calculate

