Null values
You need to do something about the null values. They will break machine learning algorithms (see Chapter 11, Machine Learning) that rely on numerical values as input. There are several popular choices when dealing with null values:
- Eliminate the rows. This is a respectable approach if null values are a very small percentage – that is, around 1% of the total dataset.
- Replace the null value with a significant value, such as the median or the mean. This is a great approach if the rows are valuable, and the column itself is reasonably balanced.
- Replace the null value with the most likely value, perhaps a 0 or 1. This is preferable to averages when the median or mean might be unrealistic based on other factors.
Note
Mode is the official term for the value that occurs the greatest number of times.
As you can see, which option you choose depends on the data. That’s a general theme that rings true for data science: no one method fits all...