It is easier to work with the architecture where we get the Q values for all the actions for a given state the network is fed with. The same is illustrated in the right-hand side of Figure 9.3. We would let the agent interact with the environment and collect states and rewards based on which we will learn the Q functions. In fact, the network would learn the Q function by minimizing the predicted Q values for all actions for a given state s with those of the target Q values. Each training record is a tuple (s(t), a(t), r(t), s(t+1)).
Bear in mind that the target Q values are to be computed based on the network itself. Let's consider the fact that the network is parametrized by the W ∈ Rd weights and we learn mapping from the states to the Q values for each action given the state. For n set of actions the network would predict i Q values...