Reinforcement Learning with Human Feedback
In this chapter, we’ll take a look at a relatively recent method that addresses situations when the desired behavior is hard to define via the explicit reward function – reinforcement learning with human feedback (RLHF) . This is also related to exploration (as the method allows humans to push learning in a new direction), the problem we discussed in Chapter 18. Surprisingly, the method, initially developed for a very specific subproblem in the RL domain, turned out to be enormously successful in the large language models (LLMs). Nowadays, RLHF is at the core of modern LLM training pipelines, and without it, the recent fascinating progress wouldn’t have been possible.
As this book is not about LLMs and modern chatbots, we will focus purely on the original paper from OpenAI and Google by Christiano et al., Deep reinforcement learning from human preferences [Chr+17], which describes the RLHF method...