Basic statistical concepts with Qlik solutions
Now that we have been introduced to Qlik tools, we will explore some of the statistical concepts that are used with them. Statistical principles play a crucial role in the development of machine-learning algorithms. These principles provide the mathematical framework for analyzing and modeling data, making predictions, and improving the accuracy of machine-learning models over time. In this section, we will become familiar with some of the key concepts that will be needed when building machine-learning solutions.
Types of data
Different data types are handled differently, and each requires different techniques. There are two major data types in typical machine-learning solutions: categorical and numerical.
Categorical data typically defines a group or category using a name or a label. Each piece of a categorical dataset is assigned to only one category and each category is mutually exclusive. Categorical data can be further divided into nominal data and ordinal data. Nominal data is the data category that names or labels a category. Ordinal data is constructed from elements with rankings, orders, or rating scales. Ordinal data can be ordered or counted but not measured. Some machine-learning algorithms can’t handle categorical variables unless these are converted (encoded) to numerical values.
Numerical data can be divided into discrete data that is countable numerical data. It is formed using natural numbers, for example, age, number of employees in a company, etc. Another form of numerical data is continuous data. An example of this type of data can be a person’s height or a student’s score. One type of data to pay attention to is datetime information. Dates and times are typically useful in machine-learning models but will require some work to turn them into numerical data.
Mean, median, and mode
The mean is calculated by dividing the sum of all values in a dataset by the number of values. The simplified equation can be formed like this:
mean = Sum of all datapoints ________________ Number of datapoints
The following is a simple example to calculate the mean of a set of data points:
X = [5,15,30,45,50]
X̅ = (5+15+30+45+50)/5
X̅ = 29
The mean is sensitive to outliers and these can significantly affect its value. The mean is typically written as X̅.
The median is the middle value of the sorted dataset. Using the dataset in the previous example, our median is 30. The main advantage of the median over the mean is that the median is less affected by outliers. If there is a high chance for outliers, it’s better to use the median instead of the mean. If we have an even number of data points in our dataset, the median is the average of two middle points.
The mode represents the most common value in a dataset. It is mostly used when there is a need to understand clustering or, for example, encoded categorical data. Calculating the mode is quite simple. First, we need to order all values and count how many times each value appears in a set. The value that appears the most is the mode. Here is a simple example:
X = [1,4,4,5,7,9]
The mode = 4 since it appears two times and all other values appear only one time. A dataset can also have multiple modes (multimodal dataset). In this case, two or more values occur with the highest frequency.
Variance
Variance (σ2) is a statistical measure that describes the degree of variability or spread in a set of data. It is the average of the squared differences from the mean of the dataset.
In other words, variance measures how much each data point deviates from the mean of the dataset. A low variance indicates that the data points are closely clustered around the mean, while a high variance indicates that the data points are more widely spread out from the mean.
The formula for variance is as follows:
σ 2 = Σ ( x i − x ̅)² _ n − 1
where σ2 is the variance of the dataset, n is the number of data points in the set, and Σ is the sum of the squared differences between each data point (xi) and the mean (x ̅). The square root of the variance is the standard deviation.
Variance is an important concept in statistics and machine learning, as it is used in the calculation of many other statistical measures, including standard deviation and covariance. It is also commonly used to evaluate the performance of models and to compare different datasets.
Variance is used to see how individual values relate to each other within a dataset. The advantage is that variance treats all deviations from the mean as the same, regardless of direction.
Example
We have a stock that returns 15% in year 1, 25% in year 2, and -10% in year 3. The mean of the returns is 10%. The difference of each year’s return to mean is 5%, 15%, and -20%. Squaring these will give 0.25%, 2.25%, and 4%. If we add these together, we will get 6.5%. When divided by 2 (3 observations – 1), we get a variance of 3.25%.
Standard deviation
Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data. It measures how much the individual data points deviate from the mean of the dataset.
A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that the data points are more spread out from the mean.
The formula for standard deviation is as follows:
σ = √ _ Σ ( x i − x ̅)² _ n − 1
where σ is the standard deviation, Σ is the sum of the squared differences between each data point (xi), and the mean (x ̅), and n is the number of data points.
Example
Continuing from our previous example, we got the variance of 3.25% for our stock. Taking the square root of the variance yields a standard deviation of 18%.
Standardization
Standardization or Z-score normalization is the concept of normalizing different variables to the same scale. This method allows comparison of scores between different types of variables. Z-score is a fractional representation of standard deviations from the mean value. We can calculate the z-score using the following formula:
z = x− x ̅ _ σ
In the formula, x is the observed value, x̅ is the mean, and σ is the standard deviation of the data.
Basically, the z-score describes how many standard deviations away a specific data point is from the mean. If the z-score of a data point is high, it indicates that the data point is most likely an outlier. Z-score normalization is one of the most popular feature-scaling techniques in data science and is an important preprocessing step. Many machine-learning algorithms attempt to find trends in data and compare features of data points. It is problematic if features are on a different scales, which is why we need standardization.
Note
Standardized datasets will have a mean of 0 and standard deviation of 1, but there are no specific boundaries for maximum and minimum values.
Correlation
Correlation is a statistical measure that describes the relationship between two variables. It measures the degree to which changes in one variable are associated with changes in another variable.
There are two types of correlation: positive and negative. Positive correlation means that the two variables move in the same direction, while negative correlation means that the two variables move in opposite directions. A correlation of 0 indicates that there is no relationship between the variables.
The most used measure of correlation is the Pearson correlation coefficient, which ranges from -1 to 1. A value of -1 indicates a perfect negative correlation, a value of 0 indicates no correlation, and a value of 1 indicates a perfect positive correlation.
The Pearson correlation coefficient can be used when the relationship of variables is linear and both variables are quantitative and normally distributed. There should be no outliers in the dataset.
Correlation can be calculated using the cor()
function in R or the scipy.stats
or NumPy
libraries in Python.
Probability
Probability is a fundamental concept in machine learning that is used to quantify the uncertainty associated with events or outcomes. Basic concepts of probability include the following:
- Random variables: A variable whose value is determined by chance. Random variables can be discrete or continuous.
- Probability distribution: A function that describes the likelihood of different values for a random variable. Common probability distributions include the normal distribution, the binomial distribution, and the Poisson distribution.
- Bayes’ theorem: A fundamental theorem in probability theory that describes the relationship between conditional probabilities. Bayes’ theorem is used in many machine-learning algorithms, including naive Bayes classifiers and Bayesian networks.
- Conditional probability: The probability of an event occurring given that another event has occurred. Conditional probability is used in many machine-learning algorithms, including decision trees and Markov models.
- Expected value: The average value of a random variable, weighted by its probability distribution. Expected value is used in many machine-learning algorithms, including reinforcement learning.
- Maximum likelihood estimation: A method of estimating the parameters of a probability distribution based on observed data. Maximum likelihood estimation is used in many machine-learning algorithms, including logistic regression and hidden Markov models.
Note
Probability is a wide concept on its own and many books have been written about this area. In this book, we are not going deeper into the details but it is important to understand the terms at a high level.
We have now investigated the basic principles of the statistics that play a crucial role in Qlik tools. Next, we will focus on the concept of defining a proper sample size. This is an important step, since we are not always able to train our model with all the data and we want our training dataset to represent the full data as much as possible.