Handling Missing Values, Duplicates, and Outliers
In any dataset, we might have missing values, duplicate values, or outliers. We need to ensure that these are handled appropriately so that the data used by the model is clean.
Handling Missing Values
Missing values in a data frame can affect the model during the training process. Therefore, they need to be identified and handled during the pre-processing stage. They are represented as NA in a data frame. Using the example that follows, we will see how to identify a missing value in a dataset.
Using the is.na(), complete.cases(), and md.pattern() functions, we will identify the missing values.
The is.na() function, as the name suggests, returns TRUE for those elements marked NA or, for numeric or complex vectors, NaN (Not a Number) , and FALSE. The complete.cases() function returns TRUE if the value is missing and md.pattern() gives a summary of the missing values.
Exercise 12: Identifying the Missing Values
In the following example, we are adding...