In the previous chapter, we concluded a comprehensive overview of all the major policy gradient algorithms. Due to their capacity to deal with continuous action spaces, they are applied to very complex and sophisticated control systems. Policy gradient methods can also use a second-order derivative, as is done in TRPO, or use other strategies, in order to limit the policy update by preventing unexpected bad behaviors. However, the main concern when dealing with this type of algorithm is their poor efficiency, in terms of the quantity of experience needed to hopefully master a task. This drawback comes from the on-policy nature of these algorithms, which makes them require new experiences each time the policy is updated. In this chapter, we will introduce a new type of off-policy actor-critic algorithm that learns a target deterministic policy, while exploring...
United States
Great Britain
India
Germany
France
Canada
Russia
Spain
Brazil
Australia
Singapore
Hungary
Ukraine
Luxembourg
Estonia
Lithuania
South Korea
Turkey
Switzerland
Colombia
Taiwan
Chile
Norway
Ecuador
Indonesia
New Zealand
Cyprus
Denmark
Finland
Poland
Malta
Czechia
Austria
Sweden
Italy
Egypt
Belgium
Portugal
Slovenia
Ireland
Romania
Greece
Argentina
Netherlands
Bulgaria
Latvia
South Africa
Malaysia
Japan
Slovakia
Philippines
Mexico
Thailand