Throughout this book, we approach two main types of model-free algorithms: the ones based on the gradient of the policy, and the ones based on the value function. From the first family, we saw REINFORCE, actor-critic, PPO, and TRPO. From the second, we saw Q-learning, SARSA, and DQN. As well as the way in which the two families learn a policy (that is, policy gradient algorithms use stochastic gradient ascent toward the steepest increment on the estimated return, and value-based algorithms learn an action value for each state-action to then build a policy), there are key differences that let us prefer one family over the other. These are the on-policy or off-policy nature of the algorithms, and their predisposition to manage large action spaces. We already discussed the differences between on-policy and off-policy in the previous...
United States
Great Britain
India
Germany
France
Canada
Russia
Spain
Brazil
Australia
Singapore
Hungary
Philippines
Mexico
Thailand
Ukraine
Luxembourg
Estonia
Lithuania
Norway
Chile
South Korea
Ecuador
Colombia
Taiwan
Switzerland
Indonesia
Cyprus
Denmark
Finland
Poland
Malta
Czechia
New Zealand
Austria
Turkey
Sweden
Italy
Egypt
Belgium
Portugal
Slovenia
Ireland
Romania
Greece
Argentina
Malaysia
South Africa
Netherlands
Bulgaria
Latvia
Japan
Slovakia