In order to carry out imitation, the expert should be able to simply demonstrate tasks capably without lots of effort, instrumentation, or engineering. Collecting too many demonstrations is time-consuming, exact state-action knowledge is impractical, and reward design is involved and takes more than task expertise. The agent should be able to achieve goals based on the demonstrations without having to devote time learning to do each and every task. To address these issues, the authors recast learning from demonstration into doing from demonstration by
(1) Only giving demonstrations during inference and,
(2) Restricting demonstrations to visual observations alone rather than full state-actions.
Instead of imitation learning, the agent must learn to imitate. This is the goal that the authors are trying to achieve.
This paper explains how existing approaches to imitation learning distill both what to do (goal) and how to do it (skills), from expert demonstrations. However, this expertise is effective but expensive supervision: it is not always practical to collect many detailed demonstrations. The authors here suggest that if an agent has access to its environment along with the expert, it can learn skills from its own experience and rely on expertise for the goals alone. And so, they have proposed a ‘Zero-shot’ method which does not include any expert actions or demonstrations during learning. The zero-shot imitator has no prior knowledge of the environment and makes no use of the expert during training. It learns from experience to follow experts, for instance, the authors conducted certain experiments such as, navigating an office with a turtlebot, and manipulating rope with a baxter robot.
Overall Score: 25/30
Average Score: 8
As per one of the reviewers, the proposed approach is well founded and the experimental evaluations are promising. The paper is well written and easy to follow. The skill function uses a RNN as function approximator and minimizes the sum of two losses i.e. the state mismatch loss over the trajectory (using an explicitly learnt forward model) and the action mismatch loss (using a model-free action prediction module) . This is hard to do in practice due to jointly learning both the forward model as well as the state mismatches. So first they are separately learnt and then fine-tuned together.