Cleaning the data
While going about data integration, we took care of some level I data cleaning as well, such as the data being in one standard data structure and the attributes having codable and intuitive titles. However, because in_df
is integrated from five different sources, the chances are that different data recording practices may have been used, which may lead to inconsistency across in_df
.
For instance, the following figure shows how varied data collection for the Gender
attribute has been:
We need to go over every attribute and make sure that there is no repetition of the same possibilities in a slightly different wording due to varying data collection or misspellings.
Detecting and dealing with outliers and errors
As our AQs are only going to rely on data visualization for answers, we don't need to detect outliers, as our addressing them would be adopting...