Clean data
One of the most problematic aspects of datasets, when it comes to machine learning, is the presence of empty data points or empty values of features for data points. Let’s illustrate that with the example of the features extracted in the previous section. In the following table, I introduced an empty data point – the NaN
value in the middle column. This means that the value does not exist.
Hello |
printf |
return |
|
printf(“Hello world!”); |
1 |
NaN |
0 |
return 1 |
0 |
0 |
1 |
Figure 5.2 – Extracted features with a NaN value in the table
If we use this data as input to a machine learning algorithm, we’ll get an error...