Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science Projects with Python

You're reading from   Data Science Projects with Python A case study approach to successful data science projects using Python, pandas, and scikit-learn

Arrow left icon
Product type Paperback
Published in Apr 2019
Publisher Packt
ISBN-13 9781838551025
Length 374 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Stephen Klosterman Stephen Klosterman
Author Profile Icon Stephen Klosterman
Stephen Klosterman
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Data Science Projects with Python
Preface
1. Data Exploration and Cleaning 2. Introduction toScikit-Learn and Model Evaluation FREE CHAPTER 3. Details of Logistic Regression and Feature Exploration 4. The Bias-Variance Trade-off 5. Decision Trees and Random Forests 6. Imputation of Missing Data, Financial Analysis, and Delivery to Client Appendix

Summary


This was the first chapter in our book, Data Science Projects with Python. Here, we made extensive use of pandas to load and explore the case study data. We learned how to check for basic consistency and correctness by using a combination of statistical summaries and visualizations. We answered such questions as "Are the unique account IDs truly unique?", "Is there any missing data that has been given a fill value?", and "Do the values of the features make sense given their definition?"

You may notice that we spent nearly all of this chapter identifying and correcting issues with our dataset. This is often the most time consuming stage of a data science project. While it is not always the most exciting part of the job, it gives you the raw materials necessary to build exciting models and insights. These will be the subjects of most of the rest of this book.

Mastery of software tools and mathematical concepts is what allows you execute data science projects, at a technical level. However, managing your relationships with clients, who are relying on your services to generate insights from their data, is just as important to a successful project. You must make as much use as you can of your business partner's understanding of the data. They are likely going to be more familiar with it than you, unless you are already a subject matter expert on the data for the project you are completing. However, even in that case, your first step should be a thorough and critical review of the data you are using.

In our data exploration, we discovered an issue that could have undermined our project: the data we had received was not internally consistent. Most of the months of the payment status features were plagued by a data reporting issue, included nonsensical values, and were not representative of the most recent month of data, or the data that would be available to the model going forward. We only uncovered this issue by taking a careful look at all of the features. While this is not always possible in different projects, especially when there is a very large number of features, you should always take the time to spot check as many features as you can. If you can't examine every feature, it's useful to check a few of every category of feature (if the features fall into categories, such as financial or demographic).

When discussing data issues like this with your client, make sure you are respectful and professional. The client may simply have forgotten about the issue when presenting you with the data. Or, they may have known about it but assumed it wouldn't affect your analysis for some reason. In any case, you are doing them an essential service by bringing it to their attention and explaining why it would be a problem to use flawed data to build a model. You should back up your claims with results if possible, showing that using the incorrect data either leads to decreased, or unchanged, model performance. Or, alternatively, you could explain that if only a different kind of data would be available in the future, compared to what's available now for training a model, the model built now will not be useful. Be as specific as you can, presenting the kinds of graphs and tables that we used to discover the data issue here.

In the next chapter, we will examine the response variable for our case study problem, which completes the initial data exploration. Then we will start to get some hands-on experience with machine learning models and learn how we can decide whether a model is useful or not.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image