The Kullback-Leibler divergence
In the previous section, we learned about the similarities between random variables from an information theory perspective. But let’s return to just a single random variable, , and ask what would happen if we had two different distributions for that variable, . As ever, we’ll start with the discrete case.
Wait a moment! How can we have two different distributions for the same random variable? The variable is either random, and so has a given distribution, or it is not random. Yes, that’s correct, but imagine that we have its true distribution, , and an approximation to its true distribution. We’ll call the approximation . It may be that the true distribution is too complex to practically work with in a data science algorithm, so we want to replace it with a more tractable distribution, . Ideally, we’d like some way of measuring the difference between and so that we can tell how good an approximation is. We need...