Two common errors that we are going to look at in this article are that of bias and Variance, and how a trade-off can be achieved between the two in order to generate a successful ML model.
Let’s first have a look at what creates these kind of errors.
Machine learning techniques or more precisely supervised learning techniques involve training, often the most important stage in the ML workflow. The machine learning model is trained using the training data.
How is this training data prepared? This is done by using a dataset for which the output of the algorithm is known. During the training stage, the algorithm analyzes the training data that is fed and produces patterns which are captured within an inferred function. This inferred function, which is derived after analysis of the training dataset, is the model that would be further used to map new examples.
An ideal model generated from this training data should be able to generalize well. This means, it should learn from the training data and should correctly predict or classify data within any new problem instance.
In general, the more complex the model is, the better it classifies the training data. However, if the model is too complex i.e it will pick up random features i.e. noise in the training data, this is the case of overfitting i.e. the model is said to overfit . On the other hand, if the model is not so complex, or missing out on important dynamics present within the data, then it is a case of underfitting.
Both overfitting and underfitting are basically errors in the ML models or algorithms. Also, it is generally impossible to minimize both these errors at the same time and this leads to a condition called as the Bias-Variance Tradeoff.
Before getting into knowing how to achieve the trade-off, lets simply understand how bias and variance errors occur.
Let’s understand each error with the help of an example. Suppose you have 3 training datasets say T1, T2, and T3, and you pass these datasets through a supervised learning algorithm. The algorithm generates three different models say M1, M2, and M3 from each of the training dataset.
Now let’s say you have a new input A. The whole idea is to apply each model on this new input A. Here, there can be two types of errors that can occur. If the output generated by each model on the input A is different(B1, B2, B3), the algorithm is said to have a high Variance Error. On the other hand, if the output from all the three models is same (B) but incorrect, the algorithm is said to have a high Bias Error.
High Variance also means that the algorithm produces a model that is too specific to the training data, which is a typical case of Overfitting. On the other hand, high bias means that the algorithm has not picked up defining patterns from the dataset, this is a case of Underfitting.
Some examples of high-bias ML algorithms are: Linear Regression, Linear Discriminant Analysis and Logistic Regression
Examples of high-variance Ml algorithms are: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
For any supervised algorithm, having a high bias error usually means it has low variance error and vise versa. To be more specific, parametric or linear ML algorithms often have a high bias but low variance. On the other hand, non-parametric or non-linear algorithms have vice versa.
The goal of any ML model is to obtain a low variance and a low bias state, which is often a task due to the parametrization of machine learning algorithms.
So how can we achieve a trade-off between the two?
Following are some ways to achieve the Bias-Variance Tradeoff:
The optimum location for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance. Practically, there is no analytical method to find the optimal level. One should use an accurate measure for error prediction and explore different levels of model complexity, and then choose the complexity level that reduces the overall error.
Generally resampling based measures such as cross-validation should be preferred over theoretical measures such as Aikake's Information Criteria.
Source: http://scott.fortmann-roe.com/docs/BiasVariance.html
(The irreducible error is the noise that cannot be reduced by algorithms but can be reduced with better data cleaning.)
These can be used to reduce the variance in model predictions. In bagging (Bootstrap Aggregating), several replicas of the original dataset are created using random selection with replacement.
One modeling algorithm that makes use of bagging is Random Forests. In Random Forest algorithm, the bias of the full model is equivalent to the bias of a single decision tree--which itself has high variance. By creating many of these trees, in effect a "forest", and then averaging them the variance of the final model can be greatly reduced over that of a single tree.
Both the k-nearest algorithms and Support Vector Machines(SVM) algorithms have low bias and high variance. But the trade-offs in both these cases can be changed. In the K-nearest algorithm, the value of k can be increased, which would simultaneously increase the number of neighbors that contribute to the prediction. This in turn would increase the bias of the model. Whereas, in the SVM algorithm, the trade-off can be changed by an increase in the C parameter that would influence the violations of the margin allowed in the training data. This will increase the bias but decrease the variance.
This means you have to ensure proper training by:
So, we have listed some of the ways where you can achieve trade-off between the two.
Both bias and variance are related to each other, if you increase one the other decreases and vice versa. By a trade-off, there is an optimal balance in the bias and variance which gives us a model that is neither underfit nor overfit. And finally, the ultimate goal of any supervised machine algorithm lies in isolating the signal from the dataset, and making sure that it eliminates the noise.