Summary
This was the first chapter in our book, Data Science Projects with Python. Here, we made extensive use of pandas to load and explore the case study data. We learned how to check for basic consistency and correctness by using a combination of statistical summaries and visualizations. We answered such questions as "Are the unique account IDs truly unique?", "Is there any missing data that has been given a fill value?", and "Do the values of the features make sense given their definition?"
You may notice that we spent nearly all of this chapter identifying and correcting issues with our dataset. This is often the most time consuming stage of a data science project. While it is not always the most exciting part of the job, it gives you the raw materials necessary to build exciting models and insights. These will be the subjects of most of the rest of this book.
Mastery of software tools and mathematical concepts is what allows you execute data science projects, at a technical level. However, managing your relationships with clients, who are relying on your services to generate insights from their data, is just as important to a successful project. You must make as much use as you can of your business partner's understanding of the data. They are likely going to be more familiar with it than you, unless you are already a subject matter expert on the data for the project you are completing. However, even in that case, your first step should be a thorough and critical review of the data you are using.
In our data exploration, we discovered an issue that could have undermined our project: the data we had received was not internally consistent. Most of the months of the payment status features were plagued by a data reporting issue, included nonsensical values, and were not representative of the most recent month of data, or the data that would be available to the model going forward. We only uncovered this issue by taking a careful look at all of the features. While this is not always possible in different projects, especially when there is a very large number of features, you should always take the time to spot check as many features as you can. If you can't examine every feature, it's useful to check a few of every category of feature (if the features fall into categories, such as financial or demographic).
When discussing data issues like this with your client, make sure you are respectful and professional. The client may simply have forgotten about the issue when presenting you with the data. Or, they may have known about it but assumed it wouldn't affect your analysis for some reason. In any case, you are doing them an essential service by bringing it to their attention and explaining why it would be a problem to use flawed data to build a model. You should back up your claims with results if possible, showing that using the incorrect data either leads to decreased, or unchanged, model performance. Or, alternatively, you could explain that if only a different kind of data would be available in the future, compared to what's available now for training a model, the model built now will not be useful. Be as specific as you can, presenting the kinds of graphs and tables that we used to discover the data issue here.
In the next chapter, we will examine the response variable for our case study problem, which completes the initial data exploration. Then we will start to get some hands-on experience with machine learning models and learn how we can decide whether a model is useful or not.