Summary
In this chapter, we’ve taken a look at the recent addition of RLHF to the RL toolbox. This method, at the core of the LLM training pipeline, allows you to increase the quality of models. In the chapter, we implemented RLHF and applied it to the SeaQuest Atari game, which should have illustrated to you how this method could be used in RL pipelines for model improvement.
In the next chapter, we’ll discuss a different family of RL methods: AlphaGo, AlphaZero, and MuZero.