Handling missing values
Checking for missing values and handling them properly is an important step in the data preparation process, if they are left untreated they can:
Lead to the behavior between the variables not being analyzed correctly
Lead to incorrect interpretation and inference from the data
To see how; move up a few pages to see how the describe
method is explained. Look at the output table; why are the counts for many of the variables different from each other? There are 1310 rows in the dataset, as we saw earlier in the section. Why is it then that the count is 1046 for age
, 1309 for pclass
, and 121 for body
. This is because the dataset doesn't have a value for 264 (1310-1046) entries in the age
column, 1 (1310-1309) entry in the pclass
column, and 1189 (1310-121) entries in the body
column. In other words, these many entries have missing values in their respective columns. If a column has a count value less than the number of rows in the dataset, it is most certainly because the...