Pieter Abbeel et al on how to improve the exploratory behaviour of Deep Reinforcement Learning algorithms

The paper, Parameter space noise for exploration proposes parameter space noise as an efficient solution for exploration, a big problem for deep reinforcement learning. This paper is authored by Pieter Abbeel, Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, and Marcin Andrychowicz.

Pieter Abbeel is currently a professor at UC Berkeley since 2008. He was also a Research Scientist at OpenAI (2016-2017). Pieter is one of the pioneers of deep reinforcement learning for robotics, including learning locomotion and visuomotor skills. His current research focuses on robotics and machine learning with particular focus on deep reinforcement learning, meta-learning, and AI safety.

Deep reinforcement learning is the combination of deep learning with reinforcement learning to create artificial agents to achieve human-level performance across many challenging domains. This article will talk about one of Pieter’s top accepted research papers in the field of deep reinforcement learning at the 6th annual ICLR conference scheduled to happen between April 30 - May 03, 2018.

Improving the exploratory behavior of Deep RL algorithms with Parameter Space Noise

What problem is the paper attempting to solve?

This paper is about the exploration challenge in deep reinforcement learning (RL) algorithms. The main purpose of exploration is to ensure that the agent’s behavior does not converge prematurely to a local optimum. Enabling efficient and effective exploration is difficult since it is not directed by the reward function of the underlying Markov decision process (MDP).

A large number of methods have been proposed to tackle this challenge in high-dimensional and/or continuous-action MDPs. These methods increase the exploratory nature of these algorithms through the addition of temporally-correlated noise or through the addition of parameter noise. The main limitation of these methods is that they are either only proposed and evaluated for the on-policy setting with relatively small and shallow function approximators or disregard all temporal structure and gradient information.

Paper summary

This paper proposes adding noise to the parameters (parameter space noise) of a deep network when taking actions in deep reinforcement learning to encourage exploration. The effectiveness of this approach is demonstrated through empirical analysis across a variety of reinforcement learning tasks (i.e.DQN, DDPG, and TRPO). It answers the following questions:

Do existing state-of-the-art RL algorithms benefit from incorporating parameter space noise?
Does parameter space noise aid in exploring sparse reward environments more effectively?
How does parameter space noise exploration compare against evolution strategies for deep policies with respect to sample efficiency?

Key Takeaways

The paper describes a method which proves parameter space noise as a conceptually simple yet effective replacement for traditional action space noise like -greedy and additive Gaussian noise.
This work shows that parameter perturbations can successfully be combined with contemporary on- and off-policy deep RL algorithms such as DQN, DDPG, and TRPO and often results in improved performance compared to action noise.
The paper attempts to prove with experiments that using parameter noise allows solving environments with very sparse rewards, in which action noise is unlikely to succeed.
Parameter space noise is a viable and interesting alternative to action space noise, which is still the effective standard in most reinforcement learning applications.

Reviewer feedback summary

Overall Score: 20/30
Average Score: 6.66

The reviewers were pleased with the paper. They termed it as a simple strategy for exploration that is effective empirically. The paper was found to be clear and well written with thorough experiments across deep RL domains. The authors have also released open-source code along with their paper for reproducibility, which was appreciated by the reviewers.

However, a common trend among the reviews was that the authors overstated their claims and contributions. The reviewers called out some statements in particular (e.g. the discussion of ES and RL). They also felt that the paper lacked a strong justification for the method other than it being empirically effective and intuitive.