Reward functions in complex environments
Before we go into the details of the RLHF method, let’s start by discussing the underlying motivation of the concept. As we discussed in Chapter 1, the reward is the core concept in RL. Without a reward, we’re blind — all the methods we’ve already discussed are heavily dependent on the reward value provided by the environment:
-
In value-based methods (Part 2 of the book), we used the reward to approximate the Q-value to evaluate the actions and choose the most prominent one.
-
In policy-based methods (Part 3), the reward was used even more directly — as a scale for the Policy Gradient. With all the math removed, we basically optimized our policy to prefer actions that bring more accumulated future reward.
-
In black-box methods (Chapter 17), we used the reward to make a decision about agent variants: should they be kept...