Sampling from data
Sampling methods and caveats are good to know as a data scientist. For example, we can use sampling to downsize a large dataset for analysis or prototyping code, we can use sampling to estimate confidence intervals, and we can use it to balance imbalanced datasets for machine learning. Let's begin with a few fundamental tenets of sampling.
The law of large numbers
The law of large numbers is a mathematical theorem, and essentially says we will approach the true mean of a random variable's outcome as we increase our number of samples. A few examples are useful here: as we roll a 6-sided dice many times, the average value of the rolls will approach 3.5, which is what we would fundamentally expect given the uniform distribution and average of the values 1-6.
In general, this means we should expect the average value of a measurement to approach an exact value with an increase in sampling, assuming the underlying process is random and follows some...