When we previously had a model, our algorithm could learn to plan and improve a policy offline. Now, with no model, our algorithm needs to become an agent and learn to explore and, while doing that, also learn and improve. This allows our agent to now learn effectively by trial and error. Let's jump back into the Chapter_3_3.py code example and follow the exercise:
- We will start right from where we left off and review the last couple of lines including the play_game function:
episode = play_game(env=env, policy=policy, display=False)
evaluate_policy_check(env, e, policy, test_policy_freq)
- Inside evaluate_policy_check, we test to see whether the test_policy_freq number has been reached. If it has, we output the current progress of the agent. In reality, what we are evaluating is how well the current policy will run an agent. The evaluate_policy_check...