Defining a proper sample size and population
Defining a proper sample size for machine learning is crucial to get accurate results. It is also a common problem that we don’t know how much training data is needed. Having a correct sample size is important for several reasons:
- Generalization: Machine-learning models are trained on a sample of data with the expectation that they will generalize to new, unseen data. If the sample size is too small, the model may not capture the full complexity of the problem, resulting in poor generalization performance.
- Overfitting: Overfitting occurs when a model fits the training data too closely, resulting in poor generalization performance. Overfitting is more likely to occur when the sample size is small because the model has fewer examples to learn from and may be more likely to fit the noise in the data.
- Statistical significance: In statistical inference, sample size is an important factor in determining the statistical significance of the results. A larger sample size provides more reliable estimates of model parameters and reduces the likelihood of errors due to random variation.
- Resource efficiency: Machine-learning models can be computationally expensive to train, especially with large datasets. Having a correct sample size can help optimize the use of computing resources by reducing the time and computational resources required to train the model.
- Decision-making: Machine-learning models are often used to make decisions that have real-world consequences. Having a correct sample size ensures that the model is reliable and trustworthy, reducing the risk of making incorrect or biased decisions based on faulty or inadequate data.
Defining a sample size
The sample size depends on several factors, including the complexity of the problem, the quality of the data, and the algorithm being used. “How much training data do I need?” is a common question at the beginning of a machine-learning project. Unfortunately, there is no correct answer to that question, since it depends on various factors. However, there are some guidelines.
Generally, the following factors should be addressed when defining a sample:
- Have a representative sample: It is essential to have a representative sample of the population to train a machine-learning model. The sample size should be large enough to capture the variability in the data and ensure that the model is not biased toward a particular subset of the population.
- Avoid overfitting: Overfitting occurs when a model is too complex and fits the training data too closely. To avoid overfitting, it is important to have a sufficient sample size to ensure that the model generalizes well to new data.
- Consider the number of features: The number of features or variables in the dataset also affects the sample size. As the number of features increases, the sample size required to train the model also increases.
- Use power analysis: Power analysis is a statistical technique used to determine the sample size required to detect a significant effect. It can be used to determine the sample size needed for a machine-learning model to achieve a certain level of accuracy or predictive power.
- Cross-validation: Cross-validation is a technique used to evaluate the performance of a machine-learning model. It involves splitting the data into training and testing sets and using the testing set to evaluate the model’s performance. The sample size should be large enough to ensure that the testing set is representative of the population and provides a reliable estimate of the model’s performance.
There are several statistical heuristic methods available to estimate a sample size. Let’s take a closer look at some of these.
Power analysis
Power analysis is one of the key concepts in machine learning. Power analysis is mainly used to determine whether a statistical test has sufficient probability to find an effect and to estimate the sample size required for an experiment considering the significance level, effect size, and statistical power.
The definition of a power in this concept is the probability that a statistical test will reject a false null hypothesis (H0) or the probability of detecting an effect (depending on whether the effect is there). A bigger sample size will result in a larger power. The main output of power analysis is the estimation of an appropriate sample size.
To understand the basics of power analysis, we need to get familiar with the following concepts:
- A type I error (α) is rejecting a H0 or a null hypothesis in the data when it’s true (false positive).
- A type II error (β) is the failure to reject a false H0 or, in other words, a probability of missing an effect that is in the data (false negative).
- The power is the probability of detecting an effect that is in the data.
- There is a direct relationship between the power and type II error:
Power = 1 – β
Generally, β should never be more than 20%, which gives us the minimum approved power level of 80%.
- The significance level (α) is the maximum risk of rejecting a true null hypothesis (H0) that you are willing to take. This is typically set to 5% (p < 0.05).
- The effect size is the measure of the strength of a phenomenon in the dataset (independent of sample size). The effect size is typically the hardest to determine. An example of an effect size would be the height difference between men and women. The greater the effect size, the greater the height difference will be. The effect size is typically marked with the letter d in formulas.
Now that we have defined the key concepts, let’s look how to use power analysis in R and Python to calculate the sample size for an experiment with a simple example. In R we will utilize a package called pwr
and with Python we will utilize the NumPy
and statsmodels.stats.power
libraries.
Let’s assume that we would like to create a model of customer behavior. We are interested to know whether there is a difference in the mean price of what our preferred customers and other customers pay at our online shop. How many transactions in each group should we investigate to get the power level of 80%?
R:
library(pwr) ch <- cohen.ES(test = "t", size = "medium") print(ch) test <- pwr.t.test(d = 0.5, power = 0.80, sig.level = 0.05) print(test)
The model will give us the following result:
Two-sample t test power calculation n = 63.76561 d = 0.5 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group
So, we will need a sample of 64 transactions in each group.
Python:
import numpy as np from statsmodels.stats.power import TTestIndPower analysis = TTestIndPower() sample_size = analysis.solve_power(effect_size = 0.5, alpha = 0.05, power = 0.8) print(str(sample_size))
Our Python code will produce the same result as our earlier R code, giving us 64 transactions in each group.
Note
Power analysis is a wide and complex topic, but it’s important to understand the basics, since it is widely utilized in many machine-learning tools. In this chapter, we have only scratched the surface of this topic.
Sampling
Sampling is a method that makes it possible to get information about the population (dataset) based on the statistics from a subset of population (sample), without having to investigate every individual value. Sampling is particularly useful if a dataset is large and can’t be analyzed in full. In this case, identifying and analyzing a representative sample is important. In some cases, a small sample can be enough to reveal the most important information, but generally, using a larger sample can increase the likelihood of representing the data as a whole.
When performing sampling, there are some aspects to consider:
- Sample goal: A property that you wish to estimate or predict
- Population: A domain from which observations are made
- Selection criteria: A method to determine whether an individual value will be accepted as a part of the sample
- Sample size: The number of data points that will form the final sample data
Sampling methods can be divided into two main categories:
Probability sampling is a technique where every element of the dataset has an equal chance of being selected. These methods typically give the best chance of creating a sample that truly represents the population. Examples of probability sampling algorithms are simple random sampling, cluster sampling, systematic sampling, and stratified random sampling.
Non-probability sampling is a method where all elements are not equally qualified for being selected. With these methods, there is a significant risk that the sample is non-representative. Examples of non-probability sampling algorithms are convenience sampling, selective sampling, snowball sampling, and quota sampling.
When using sampling as a methodology for training set creation, it is recommended to utilize a specialized sampling library in either R or Python. This will automate the process and produce a sample based on selected algorithms and specifications. In R, we can utilize the standard sample
library and in Python there is a package called random.sample
. Here is a simple example of random sampling with both languages:
R:
dataset <- data.frame(id = 1:20, fact = letters[1:20]) set.seed(123) sample <- dataset[sample(1:nrow(dataset), size=5), ]
The content of the sample frame will look like this:
id fact 15 15 o 19 19 s 14 14 n 3 3 c 10 10 j
Python:
import random random.seed(123) dataset = [[1,'v'],[5,'b'],[7,'f'],[4,'h'],[0,'l']] sample = random.sample(dataset, 3) print(sample)
The result of the sample vector will look like the following:
[[1, 'v'], [7, 'f'], [0, 'l']]
Note
There is a lot of material covering different sampling techniques and how to use those with R and Python on the internet. Take some time to practice these techniques with simple datasets.
Sampling errors
In all sampling methods, errors are bound to occur. There are two types of sampling errors:
- Selection bias is introduced by the selection of values that are not random to be part of the sample. In this case, the sample is not representative of the dataset that we are looking to analyze.
- Sampling error is a statistical error that occurs when we don’t select the sample that represents the entire population of data. In this case, the results of the prediction or model will not represent the actual results that are generalized to cover the entire dataset.
Training datasets will always contain a sampling error, since it cannot represent the entire dataset. Sample errors in the context of binary classification can be calculated using the following simplified formula:
Sample error = False positive + False negative ____________________________________________ True positive + False positive + True negative + False negative
If we have, for example, a dataset containing 45 values and out of these 12 are false values, we will get a sample error of 12/45 = 26.67%.
The above formula can be only utilized in context of binary classification. When estimating the population mean (μ) from a sample mean ( _ x ), the standard error is calculated as follows:
SE = σ _ √ _ n
- SE (Standard Error): The standard error is a measure of the variability or uncertainty in a sample statistic. It quantifies how much the sample statistic is expected to vary from the true population parameter. In other words, it gives you an idea of how reliable or precise your sample estimate is.
- σ (population standard deviation): This is the standard deviation of the entire population you’re trying to make inferences about. It represents the amount of variability or spread in the population data. In practice, the population standard deviation is often unknown, so you may estimate it using the sample standard deviation (s) when working with sample data.
- n (sample size): The number of observations or data points in your sample.
Example
We are conducting a survey to estimate the average age (mean) of residents in a small town. We collect a random sample of 50 residents and find the following statistics:
- Sample mean ( _ x ): 35 years
- Sample standard deviation (s): 10 years (an estimate of the population standard deviation)
- Sample size (n): 50 residents
SE = 10 _ √ _ 50 = 1.42 years
So, the standard error of the sample mean is approximately 1.42 years. This means that if you were to take multiple random samples of the same size from the population and calculate the mean for each sample, you would expect those sample means to vary around 35 years, with an average amount of variation of 1.42 years.
Standard error is often used to construct confidence intervals. For instance, you might use this standard error to calculate a 95% confidence interval for the average age of residents in the town, which would allow you to estimate the range within which the true population mean age is likely to fall with 95% confidence.
As we can see, sample error, often referred to as “sampling error,” is not represented by a single formula. Instead, it is a concept that reflects the variability or uncertainty in the estimates or measurements made on a sample of data when trying to infer information about a larger population. The specific formula for sampling error depends on the statistic or parameter you are estimating and the characteristics of your data. In practice, you would use statistical software or tools to calculate the standard error for the specific parameter or estimate you are interested in.
Training and test data in machine learning
The preceding methods for defining a sample size will work well if we need to define the amount of needed data without a large collection of historic data covering the phenomenon that we are investigating. In many cases, we have a large dataset and we would like to produce training and test datasets from that historical data. Training datasets are used to train our machine-learning model and test datasets are used to validate the accuracy of our model. Training and test datasets are the key concepts in machine learning.
We can utilize power analysis and sampling to create training and testing datasets, but sometimes there is no need to make a complex analysis if our sample is already created. The training dataset is the biggest subset of the original dataset and will be used to fit the machine-learning model. The test dataset is another subset of original data and is always independent of the training dataset.
Test data should be well organized and contain data for each type of scenario that the model would be facing in the production environment. Usually it is 20–25% of the total original dataset. An exact split can be adjusted based on the requirements of a problem or the dataset characteristics.
Generating a training and testing dataset from an original dataset can also be done using R or Python. Qlik functions can be used to perform this action in load script.
Now that we have investigated some of the concepts to define a good sample, we can focus on the concepts used to analyze model performance and reliability. These concepts are important, since using these techniques allow us to develop our model further and make sure that it gives proper results.