RLHF experiments
To get a better understanding of the pipeline we’ve just discussed, let’s implement it ourselves (as “doing is the best way to learn something”). In the previous chapter, we tried the Atari SeaQuest environment, which is tricky from the exploration point of view, so it is logical to take this environment and check what we can achieve with human feedback.
To limit the scope of the chapter and make the example more reproducible, I made the following modifications to the experiments described in the RLHF paper [Chr+17]:
-
I focused on a single SeaQuest environment. The goal was to improve the agent’s gameplay in comparison to the A2C results we got in Chapter 18 — an average score of 400 and episodes of 500 steps (due to the lack of oxygen).
-
Instead of asynchronous labeling and reward model training, I split them into separate steps:
- ...