A VAE is extremely similar in nature to the more basic autoencoder; it learns how to encode the data that it is fed into a simplified representation, and it is then able to recreate it on the other side based on that encoding. Unfortunately, standard autoencoders are usually limited to tasks such as denoising. Using standard autoencoders for generation is problematic, as the latent space in standard autoencoders does not lend itself to this purpose. The encodings they produce may not be continuous—they may cluster around very specific portions, and may be difficult to perform interpolation on.
However, as we want to build a more generative model, and we don't want to replicate the same image that we put in, we need variations on the input. If we attempt to do this with a standard autoencoder, there is a good chance that the end result will be rather...