Missing Values
When there is no value (that is, a null value) recorded for a particular feature in a data point, we say that the data is missing. Having missing values in a real dataset is inevitable; no dataset is ever perfect. However, it is important to understand why the data is missing, and whether there is a factor that has affected the loss of data. Appreciating and recognizing this allows us to handle the remaining data in an appropriate manner. For example, if the data is missing randomly, then it's highly likely that the remaining data is still representative of the population. However, if the missing data is not random in nature and we assume that it is, it could bias our analysis and subsequent modeling.
Let's look at the common reasons (or mechanisms) for missing data:
- Missing Completely at Random (MCAR): Values in a dataset are said to be MCAR if there is no correlation whatsoever between the value missing and any other recorded variable or external...