Theoretical background
Let’s take a look at the original RLHF method published in 2017 by OpenAI and Google researchers [Chr+17]. Since the publication (and especially after ChatGPT’s release), this method has been an area of active research. For recent developments, you can the check papers at https://github.com/opendilab/awesome-RLHF. In addition, we’ll discuss the role of RLHF in the LLM training process.
Method overview
The authors of the paper experimented with two classes of problems: several environments from MuJoCo simulated robotics (similar to the continuous control problems we discussed in Chapter 15 and Chapter 16) and several Atari games.
The core idea is to keep the original RL model, but replace the reward from the environment with a neural network called reward predictor, which is trained on data gathered by humans. This network (represented as r̂ (o,a) in the paper ) takes the observation and...