In this recipe, we will train the DQN model to play Flappy Bird.
In each step of the training, we take an action following the epsilon-greedy policy: under a certain probability (epsilon), we will take a random action, flapping or not flapping in our case; otherwise, we select the action with the highest value. We also adjust the value of epsilon for each step as we favor more exploration at the beginning and more exploitation when the DQN model is getting more mature.
As we have seen, the observation for each step is a two-dimensional image of the screen. We need to transform the observation images into states. Simply using one image from a step will not provide enough information to guide the agent as to how to react. Hence, we form a state using images from four adjacent steps. We will first reshape the image into the expected size, then concatenate...