Suppose you were training a network to classify pictures of cats and dogs. Over the course of training, the intermediate layers will learn different representations or features from the input values (such as cat ears, dog eyes, and so on), combining them in a probabilistic fashion to detect the presence of an output class (that is, whether a picture is of a cat or a dog).
Yet, while performing inference on an individual image, do we need the feature that detects cat ears to ascertain that this particular image is of a dog? The answer in almost all cases is a resounding no. Most of the time, we can assume that most features that a network learns during training are actually not relevant for each individual prediction. Hence, we want our network to learn sparse representations for each input, a resulting tensor representation where most...