Describing our samples
Now that we understand the concepts of populations, samples, and random variables, what tools can we use to describe and understand our data sample?
Measures of central tendency
The expected value is a statistical measure that represents the average value of a random variable, weighted by its probability of occurrence. It provides a way to estimate the central tendency of a probability distribution and is useful for decision-making and predicting uncertain events or outcomes.
Measures of central tendency, including mean, median, and mode, are statistical measures that describe the central or typical value of a dataset.
The mean is the arithmetic average of a dataset, calculated by adding up all the values and dividing them by the number of values. It is a common measure of central tendency and is sensitive to outliers (values that are significantly higher or lower than the majority of the data points, often falling far from the mean). The mean can be influenced by extreme values and may not be representative of the entire dataset if there are outliers.
The median is the middle value of a dataset, with an equal number of values above and below it. It is a robust measure of central tendency and is less sensitive to outliers than the mean. The median is useful for skewed datasets, where the mean may not accurately represent the center of the data.
The mode is the value that occurs most frequently in a dataset. It is another measure of central tendency and is useful for datasets with discrete values or when the most frequent value is of particular interest. The mode can be used for both categorical and numerical data.
The following figure shows the differences between the mean, median, and mode for two different distributions of data. Imagine that this dataset shows the range of prices of a consumer product, say bottles of wine on an online wine merchant.
For symmetrical distributions, these three measures are equal; however, for asymmetrical data, they differ. The choice of which measure to use may depend on the distribution of your data. The mean can often be skewed by extreme outliers – for example, one really expensive bottle of wine is not reflective of most of the bottles being sold on the site, so you may want to use the median to better understand the average value within your dataset, and not get scared away from buying from the store!
Figure 1.4: The mode, median, and mean for a symmetrical distribution and an asymmetrical distribution
Overall, the expected value and measures of central tendency are important statistical concepts that play a critical role in data science, ML, and decision-making. They provide you with a way to understand and describe the characteristics of a dataset, and they help decision-makers make better-informed decisions based on the analysis of uncertain events or outcomes.
Measures of dispersion
Measures of dispersion are statistical measures that describe how spread out or varied a dataset is. They provide us with a way to understand the variability of the data and can be used to compare datasets.
Range
The range is a simple measure of dispersion that represents the difference between the highest and lowest values in a dataset. It is easy to calculate and provides a rough estimate of the spread of the data. For example, the range of the heights of students in a class would be the difference between the tallest and shortest students.
Variance and standard deviation
Variance and standard deviation are more advanced measures of dispersion that provide a more accurate and precise estimate of the variability of the data.
Variance is a measure of how far each value in a set of data is from the mean value. It is calculated by taking the sum of the squared differences between each value and the mean, divided by the total number of values:
Standard deviation is the square root of the variance:
For example, suppose a company wants to compare the salaries of two different departments. The standard deviation of the salaries in each department can be calculated to determine the variability of the salaries within each department. The department with a higher standard deviation would have more variability in salaries than the department with a lower standard deviation.
Interquartile range
The interquartile range (IQR) is a measure of dispersion that represents the difference between the 75th and 25th percentiles of a dataset. In other words, it is the range of the middle 50% of the data. It is useful for datasets with outliers as it is less sensitive to extreme values than the range.
For example, suppose a teacher wants to compare the test scores of two different classes. One class has a few students with very high or very low scores, while the other class has a more consistent range of scores. The IQR of each class can be calculated to determine the range of scores that most students fall into.
Measures of dispersion are important statistical measures that provide insight into the variability of a dataset.
Degrees of freedom
Degrees of freedom is a fundamental concept in statistics that refers to the number of independent values or quantities that can vary in an analysis without breaking any constraints. It is essential to understand degrees of freedom when working with various statistical tests and models, such as t-tests, ANOVA, and regression analysis.
In simpler terms, degrees of freedom represents the amount of information in your data that is free to vary when estimating statistical parameters. The concept is used in hypothesis testing to determine the probability of obtaining your observed results if the null hypothesis is true.
For example, let’s say you have a sample of ten observations and you want to calculate the sample mean. Once you have calculated the mean, you have nine degrees of freedom remaining (10 - 1 = 9). This is because if you know the values of nine observations and the sample mean, you can always calculate the value of the 10th observation.
The general formula for calculating degrees of freedom is as follows:
df = n − p
Here, we have the following:
- n is the number of observations in the sample
- p is the number of parameters estimated from the data
Degrees of freedom is used in various statistical tests to determine the critical values for test statistics and p-values. For instance, in a t-test for comparing two sample means, degrees of freedom is used to select the appropriate critical value from the t-distribution table.
Understanding degrees of freedom is crucial for data science leaders as it helps them interpret the results of statistical tests and make informed decisions based on the data. It also plays a role in determining the complexity of models and avoiding overfitting, which occurs when a model is too complex and starts to fit the noise in the data rather than the underlying patterns.
Correlation, causation, and covariance
Correlation, causation, and covariance are important concepts in data science, ML, and decision-making. They are all related to the relationship between two or more variables and can be used to make predictions and inform decision-making.
Correlation
Correlation is a measure of the strength and direction of the relationship between two variables. It is a statistical measure that ranges from -1 to 1. A correlation of 1 indicates a perfect positive correlation, a correlation of 0 indicates no correlation, and a correlation of -1 indicates a perfect negative correlation.
For example, suppose we want to understand the relationship between a person’s age and their income. If we observe that as a person’s age increases, their income also tends to increase, this will indicate a positive correlation between age and income.
Causation
Causation refers to the relationship between two variables in which one variable causes a change in the other variable. Causation is often inferred from correlation, but it is important to note that correlation does not necessarily imply causation.
For example, suppose we observe a correlation between the number of ice cream cones sold and the number of drownings in a city. While these two variables are correlated, it would be incorrect to assume that one causes the other. Rather, there may be a third variable, such as temperature, that causes both the increase in ice cream sales and the increase in drownings.
Covariance
Covariance is a measure of the joint variability of two variables. It measures how much two variables change together. A positive covariance indicates that the two variables tend to increase or decrease together, while a negative covariance indicates that the two variables tend to change in opposite directions.
For example, suppose we want to understand the relationship between a person’s height and their weight. If we observe that as a person’s height increases, their weight also tends to increase, this will indicate a positive covariance between height and weight.
Correlation, causation, and covariance are important concepts in data science. By understanding these concepts, decision-makers can better understand the relationships between variables and make better-informed decisions based on the analysis of the data.
Covariance measures how two variables change together, indicating the direction of the linear relationship between them. However, covariance values are difficult to interpret because they are affected by the scale of the variables. Correlation, on the other hand, is a standardized measure that ranges from -1 to +1, making it easier to understand and compare the strength and direction of linear relationships between variables.
It is important to note that correlation does not necessarily imply causation and that other factors may be responsible for observed relationships between variables. A strong correlation between two variables does not automatically mean that one variable causes the other as there may be hidden confounding factors influencing both variables simultaneously.
The shape of data
When working with samples of data, it is helpful to understand the “shape” of the data, or how the data is distributed. In this respect, we can consider distributions of probabilities for both continuous and discrete data. These probability distributions can be used to describe and understand your data. Probability distributions can help you identify patterns or trends in the data. For example, if your data follows a normal distribution, it suggests that most values are clustered around the mean, with fewer values at the extremes. Recognizing these patterns can help inform decision-making or further analysis.