Pieter Abbeel is currently a professor at UC Berkeley since 2008. He was also a Research Scientist at OpenAI (2016-2017). Pieter is one of the pioneers of deep reinforcement learning for robotics, including learning locomotion and visuomotor skills. His current research focuses on robotics and machine learning with particular focus on deep reinforcement learning, meta-learning, and AI safety.
Deep reinforcement learning is the combination of deep learning with reinforcement learning to create artificial agents to achieve human-level performance across many challenging domains. This article will talk about one of Pieter’s top accepted research papers in the field of deep reinforcement learning at the 6th annual ICLR conference scheduled to happen between April 30 - May 03, 2018.
This paper is about the exploration challenge in deep reinforcement learning (RL) algorithms. The main purpose of exploration is to ensure that the agent’s behavior does not converge prematurely to a local optimum. Enabling efficient and effective exploration is difficult since it is not directed by the reward function of the underlying Markov decision process (MDP).
A large number of methods have been proposed to tackle this challenge in high-dimensional and/or continuous-action MDPs. These methods increase the exploratory nature of these algorithms through the addition of temporally-correlated noise or through the addition of parameter noise. The main limitation of these methods is that they are either only proposed and evaluated for the on-policy setting with relatively small and shallow function approximators or disregard all temporal structure and gradient information.
This paper proposes adding noise to the parameters (parameter space noise) of a deep network when taking actions in deep reinforcement learning to encourage exploration. The effectiveness of this approach is demonstrated through empirical analysis across a variety of reinforcement learning tasks (i.e.DQN, DDPG, and TRPO). It answers the following questions:
Overall Score: 20/30
Average Score: 6.66
The reviewers were pleased with the paper. They termed it as a simple strategy for exploration that is effective empirically. The paper was found to be clear and well written with thorough experiments across deep RL domains. The authors have also released open-source code along with their paper for reproducibility, which was appreciated by the reviewers.
However, a common trend among the reviews was that the authors overstated their claims and contributions. The reviewers called out some statements in particular (e.g. the discussion of ES and RL). They also felt that the paper lacked a strong justification for the method other than it being empirically effective and intuitive.