Our goal is to create a pole balancing controller that's able to maintain a system in a stable state within defined constraints for as long as possible, but at least for the expected number of time steps specified in the experiment configuration (500,000). Thus, the objective function must optimize the duration of stable pole-balancing and can be defined as the logarithmic difference between the expected number of steps and the actual number of steps obtained during the evaluation of the phenotype ANN. The loss function is given as follows:
In this experiment, is the expected number of time steps from the configuration of the experiment, and is the actual number of time steps during which the controller was able to maintain a stable pole balancer state within allowed bounds (refer to the reinforcement signal definition...