Sometimes, a dataset can contain missing features, so there are a few options that can be taken into account:
- Removing the whole line
- Creating a submodel to predict those features
- Using an automatic strategy to input them according to the other known values
The first option is the most drastic one and should only be considered when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the Imputer class, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent...