The procedure to train the Q-learning agent may look familiar to you already, because it has many of the same lines of code as, and also a similar structure to, the boilerplate code that we used before. Instead of choosing a random action from the environment's actions space, we now get the action from the agent using the agent.get_action(obs) method. We also call the agent.learn(obs, action, reward, next_obs) method after sending the agent's action to the environment and receiving the feedback. The training function is listed here:
def train(agent, env):
best_reward = -float('inf')
for episode in range(MAX_NUM_EPISODES):
done = False
obs = env.reset()
total_reward = 0.0
while not done:
action = agent.get_action(obs)
next_obs, reward, done, info = env...