The battle between equal actors
The final example in this chapter is the situation when one policy drives fighting between two groups of identical agents. This version is implemented in Chapter25/battle_dqn.py
. The code is straightforward and won't be put here.
I did only a couple of experiments with the code, so hyperparameters could be improved. In addition, you can experiment with the training process. In the code, both groups are driven by the same policy that we are optimizing, which may not be the best approach. Instead, you can experiment with an AlphaGo Zero style of training, when the best policy is used for one group and another group is driven by the policy that we are optimizing at the moment. Once the best policy starts to consistently lose, it is updated. In this case, the optimized policy may have time to learn all the tricks and weaknesses of the current best policy, which may start an improvement loop.
In my experiments, the training wasn't very stable...