Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Wrangling with Python

You're reading from   Data Wrangling with Python Creating actionable data from raw sources

Arrow left icon
Product type Paperback
Published in Feb 2019
Publisher Packt
ISBN-13 9781789800111
Length 452 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Shubhadeep Roychowdhury Shubhadeep Roychowdhury
Author Profile Icon Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
Dr. Tirthajyoti Sarkar Dr. Tirthajyoti Sarkar
Author Profile Icon Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Data Wrangling with Python
Preface
1. Introduction to Data Wrangling with Python FREE CHAPTER 2. Advanced Data Structures and File Handling 3. Introduction to NumPy, Pandas, and Matplotlib 4. A Deep Dive into Data Wrangling with Python 5. Getting Comfortable with Different Kinds of Data Sources 6. Learning the Hidden Secrets of Data Wrangling 7. Advanced Web Scraping and Data Gathering 8. RDBMS and SQL 9. Application of Data Wrangling in Real Life Appendix

Introduction


Data science and analytics are taking over the whole world and the job of a data scientist is routinely being called the coolest job of the 21st century. But for all the emphasis on data, it is the science that makes you – the practitioner – truly valuable.

To practice high-quality science with data, you need to make sure it is properly sourced, cleaned, formatted, and pre-processed. This book teaches you the most essential basics of this invaluable component of the data science pipeline: data wrangling. In short, data wrangling is the process that ensures that the data is in a format that is clean, accurate, formatted, and ready to be used for data analysis.

A prominent example of data wrangling with a large amount of data is the one conducted at the Supercomputer Center of University of California San Diego (UCSD). The problem in California is that wildfires are very common, mainly because of the dry weather and extreme heat, especially during the summers. Data scientists at the UCSD Supercomputer Center gather data to predict the nature and spread direction of the fire. The data that comes from diverse sources such as weather stations, sensors in the forest, fire stations, satellite imagery, and Twitter feeds might still be incomplete or missing. This data needs to be cleaned and formatted so that it can be used to predict future occurrences of wildfires.

This is an example of how data wrangling and data science can prove to be helpful and relevant.

Importance of Data Wrangling

Oil does not come in its final form from the rig; it has to be refined. Similarly, data must be curated, massaged, and refined to be used in intelligent algorithms and consumer products. This is known as wrangling. Most data scientists spend the majority of their time data wrangling.

Data wrangling is generally done at the very first stage of a data science/analytics pipeline. After the data scientists identify useful data sources for solving the business problem (for instance, in-house database storage or internet or streaming sensor data), they then proceed to extract, clean, and format the necessary data from those sources.

Generally, the task of data wrangling involves the following steps:

  • Scraping raw data from multiple sources (including web and database tables)

  • Imputing, formatting, and transforming – basically making it ready to be used in the modeling process (such as advanced machine learning)

  • Handling read/write errors

  • Detecting outliers

  • Performing quick visualizations (plotting) and basic statistical analysis to judge the quality of your formatted data

This is an illustrative representation of the positioning and essential functional role of data wrangling in a typical data science pipeline:

Figure 1.1: Process of data wrangling

The process of data wrangling includes first finding the appropriate data that's necessary for the analysis. This data can be from one or multiple sources, such as tweets, bank transaction statements in a relational database, sensor data, and so on. This data needs to be cleaned. If there is missing data, we will either delete or substitute it, with the help of several techniques. If there are outliers, we need to first detect them and then handle them appropriately. If data is from multiple sources, we will have to perform join operations to combine it.

In an extremely rare situation, data wrangling may not be needed. For example, if the data that's necessary for a machine learning task is already stored in an acceptable format in an in-house database, then a simple SQL query may be enough to extract the data into a table, ready to be passed on to the modeling stage.

You have been reading a chapter from
Data Wrangling with Python
Published in: Feb 2019
Publisher: Packt
ISBN-13: 9781789800111
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime