Sampling from distributions
So far, we’ve learned a lot about random variables, probability distributions, and how to calculate some of the key characteristics of a distribution such as its mean and variance, and we’ve learned about some commonly occurring distributions. But so far, it doesn’t feel like we’ve learned much about data. We’ll now change that.
How datasets relate to random variables and probability distributions
We said at the beginning of this chapter that all data is random. This means when data is captured or generated, we are drawing or sampling values from some underlying probability distribution. This is illustrated schematically in Figure 2.10:
Figure 2.10: Diagram illustrating how real data is generated as samples from a population
A sample is finite. It represents a snapshot or subset of the entirety of possible outcomes; for example, a subset of all users who might visit a website. But from...