The Kullback-Leibler divergence
In the previous section, we learned about the similarities between random variables from an information theory perspective. But let’s return to just a single random variable, , and ask what would happen if we had two different distributions for that variable,
. As ever, we’ll start with the discrete case.
Wait a moment! How can we have two different distributions for the same random variable? The variable is either random, and so has a given distribution, or it is not random. Yes, that’s correct, but imagine that we have its true distribution, , and an approximation to its true distribution. We’ll call the approximation
. It may be that the true distribution is too complex to practically work with in a data science algorithm, so we want to replace it with a more tractable distribution,
. Ideally, we’d like some way of measuring the difference between
and
so that we can tell how good an approximation
is. We need...