It’s important to understand that there are two different types of statistics: descriptive statistics (methods used to summarize or describe observations) and inferential statistics (using those observations as a basis for making estimates or predictions) – that is, inferences about a situation that has not been investigated yet.
Look at the following two example statements. Which of them is a “descriptive” statistic and which is “inferential?”
- Based on our forecasting, we expect sales revenue next year to increase by 35%.
- Our average rating within our customer base was 8 out of 10.
The first statement is inferential as it goes beyond what has been observed in the past to make inferences about the future, while the second statement is descriptive as it summarizes historical observations.
Within data science, often, data is first explored with descriptive statistics as part of what is known as exploratory data analysis (EDA), attempting to profile and understand the data. Following this, statistical or ML models trained on a set of data (known as model training) can be used to make inferences on unseen data (known as model inference or execution). We will revisit this topic later in this book when we cover the basics of ML.
The distinction between descriptive and inferential statistics depends on the differences between samples and populations, two more important terms within statistics.
In statistical terminology, population not only refers to populations of people, but it may also equally refer to populations of transactions, products, or retail stores. The main point is that “population” refers to every example within a studied group. It may not be the case that a data scientist is interested in every attribute of a population – it may be that they are only interested in the sales revenues of retail stores or the price of products.
However, even if a data scientist is interested in one characteristic of a population, they will likely not have the luxury to study all members of it. Usually, they will have to study a sample – a relatively small selection – from within a population. This is often due to the limitations of time and expense, or the availability of data, where only a sample of the data is available.
In this case, descriptive statistics can be used to summarize the sampled data, and it is with inference that a data scientist can attempt to go beyond the available data to generalize information to the entire population.
So, to summarize, descriptive statistics involves summarizing a sample, whereas inferential statistics is concerned with generalizing a sample to make inferences about the entire population.
How accurate are these generalizations from the sample to the population? This is a large part of what statistics are about: measuring uncertainty and errors. It is useful, when working with the results from statistical models or even ML models, to be comfortable with the idea of uncertainty and how to measure it, not to shy away from it. Sometimes, business stakeholders may not want to see the margins of error, along with the outputs of simple statistical techniques, as they want to know things with complete certainty. Otherwise, any degree of uncertainty shown alongside results might be blown out of proportion.
However, we can rarely observe an entire population when making inferences, or have a model generalize to every possible edge case, to have absolute certainty in any result.
However, we can do a lot better than human intuition, and it is better to take a more scientific stance to understand and measure the margin of error and uncertainty with our inferences and predictions. Unconsciously, we make decisions every day with partial information and some uncertainty. For example, if you’ve ever booked a hotel, you may have looked at a sample of hotels and read a sample of customer reviews but had to decide on which hotel to book based on this sample. You may have seen a hotel with one five-star review and another with 1,000 reviews averaging 4.8 stars. Although the first hotel had a higher average rating, which hotel would you book? Probably the latter, because you could infer that the margin of error in the rating was less, but importantly there is still some margin of error as not every customer may have given a review.
In the data science, ML, and AI worlds, this ability to investigate and understand uncertainty when working with data science and have criteria around what margin of error would be acceptable for your business use case is critical to knowing when to proceed with deploying a model to production.
Sampling strategies
In data science, sampling is the process of selecting a subset of data from a larger population. Sampling can be a powerful tool for decision-makers to draw inferences and make predictions about the population, but it is important to choose the right sampling strategy to ensure the validity and reliability of the results.
Random sampling
Random sampling is the most common and straightforward sampling strategy. In this method, each member of the population has an equal chance of being selected for the sample. This can be done through a variety of techniques, such as simple random sampling, stratified random sampling, or cluster sampling.
Simple random sampling involves randomly selecting individuals from the population without any restrictions or stratification. Stratified random sampling involves dividing the population into strata or subgroups based on certain characteristics and then randomly selecting individuals from each stratum. Cluster sampling involves dividing the population into clusters and randomly selecting entire clusters to be included in the sample.
Random sampling can be useful when the population is large and homogenous, meaning that all members have similar characteristics. However, it may not be the best strategy when the population is diverse and there are significant differences between subgroups.
Convenience sampling
Convenience sampling involves selecting individuals from the population who are easily accessible or available. This can include individuals who are in a convenient location, such as in the same office or building, or individuals who are readily available to participate in the study.
While convenience sampling can be a quick and easy way to gather data, it is not the most reliable strategy. The sample may not be representative of the population as it may exclude certain subgroups or over-represent others.
Stratified sampling
Stratified sampling involves dividing the population into subgroups based on certain characteristics and then selecting individuals from each subgroup to be included in the sample. This strategy can be useful when the population is diverse and there are significant differences between subgroups.
In stratified sampling, the size of the sample is proportional to the size of each subgroup in the population. This ensures that each subgroup is adequately represented in the sample, and the results can be extrapolated to the population with greater accuracy.
Cluster sampling
Cluster sampling involves dividing the population into clusters and randomly selecting entire clusters to be included in the sample. This strategy can be useful when the population is geographically dispersed or when it is easier to access clusters than individuals.
Cluster sampling involves dividing the population into clusters, which are typically based on geographic proximity or other shared characteristics. From these clusters, a random sample of clusters is selected, and all members within the selected clusters are included in the sample. This strategy can be useful when the population is geographically dispersed or when it is more feasible to access and survey entire clusters rather than individual participants.
Cluster sampling is often more cost-effective and efficient than other sampling methods, especially when dealing with large, spread-out populations. However, it may lead to higher sampling error compared to simple random sampling if the clusters are not representative of the entire population:
Figure 1.2: Stratified random sampling and cluster sampling
Sampling is an important tool for decision-makers to draw inferences and make predictions about a population. The choice of sampling strategy will depend on the characteristics of the population and the research question being asked. Random sampling, stratified sampling, and cluster sampling are all useful strategies, but it is important to consider the potential biases and limitations of each method. By selecting the appropriate sampling strategy, decision-makers can ensure that their results are reliable and valid and can make better-informed decisions based on the data.
Random variables
What do we do with the members of a sample once we have them?
This is where the concept of random variables comes in.
In data science, a random variable is a variable whose value is determined by chance. Random variables are often used to model uncertain events or outcomes, and they play a crucial role in statistical analysis, ML, and decision-making.
Random variables are mathematical functions that are utilized to assign a numerical value to each potential outcome of a random process. For example, when flipping a coin, the value of 0 can be assigned to tails and 1 to heads, effectively causing the random variable, X, to adopt the values of 0 or 1:
X = {1, if heads 0, if tails
There are two types of random variables: discrete and continuous. Discrete random variables can only take on a finite or countable number of values, while continuous random variables can take on any value within a specified range.
For example, the outcome of rolling a six-sided die is a discrete random variable as it can only take on the values 1, 2, 3, 4, 5, or 6. On the other hand, the height of a person is a continuous random variable as it can take on any value within a certain range.
Random variables are often used in the context of sampling strategies as they provide a way to model and analyze uncertain outcomes in a sample.
For example, suppose a decision maker wants to estimate the average height of students at a university. One possible sampling strategy would be simple random sampling, in which a random sample of students is selected from the population of all students at the university.
Probability distribution
The probability distribution of a random variable describes the likelihood of each possible value of the variable. For a discrete random variable, the probability distribution is typically represented by a probability mass function (PMF), which gives the probability of each possible value. For a continuous random variable, the probability distribution is typically represented by a probability density function (PDF), which gives the probability density at each point in the range.