Papers
DPO: https://arxiv.org/pdf/2305.18290 DPO Failure Mode: https://arxiv.org/pdf/2402.13228 SIMPO: https://arxiv.org/pdf/2405.14734 CPO: https://arxiv.org/pdf/2401.08417 KTO: https://arxiv.org/pdf/2402.01306 RPO: https://arxiv.org/pdf/2402.10958 PPO: https://www.adaptive-ml.com/post/from-zero-to-ppo SPIN: https://arxiv.org/pdf/2401.01335 Online vs. Offline Alignment https://arxiv.org/abs/2405.08448 DPO vs. PPO https://arxiv.org/pdf/2404.10719 https://arxiv.org/pdf/2406.09279
Discussions https://x.com/Teknium1/status/1869136010053140926 https://x.com/Teknium1/status/1818012735210405920 https://x.com/kalomaze/status/1834402347755143168 https://x.com/kalomaze/status/1876302592202195035 https://x.com/EsotericCofe/status/1876266464468189252 https://www.blog.chai-research.com/post/chai-gpt-rlhf-part-i-reward-modelling
Frameworks https://github.com/axolotl-ai-cloud/axolotl