Preference alignment
Preference alignment regroups techniques to fine-tune models on preference data. In this section, we provide an overview of this field and then focus on the technique we will implement: Direct Preference Optimization (DPO).
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) combines reinforcement learning (RL) with human input to align models with human preferences and values. RLHF emerged as a response to challenges in traditional RL methods, particularly the difficulty of specifying reward functions for complex tasks and the potential for misalignment between engineered rewards and intended objectives.
The origins of RLHF can be traced back to the field of preference-based reinforcement learning (PbRL), which was independently introduced by Akrour et al. and Cheng et al. in 2011. PbRL aimed to infer objectives from qualitative feedback, such as pairwise preferences between behaviors, rather than relying on...