A machine learning problem can also be analyzed in terms of information transfer or exchange. Our dataset is composed of n features, which are considered independent (for simplicity, even if it's often a realistic assumption) and drawn from n different statistical distributions. Therefore, there are n probability density functions pi(x) which must be approximated through other n qi(x) functions. In any machine learning task, it's very important to understand how two corresponding distributions diverge and what the amount of information we lose is when approximating the original dataset.
Elements of information theory
Entropy
The most useful measure in information theory (as well as in machine learning) is called...