Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Practical Data Wrangling
Practical Data Wrangling

Practical Data Wrangling: Expert techniques for transforming your raw data into a valuable source for analytics

Arrow left icon
Profile Icon Visochek
Arrow right icon
€8.99 €19.99
eBook Nov 2017 204 pages 1st Edition
eBook
€8.99 €19.99
Paperback
€24.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Visochek
Arrow right icon
€8.99 €19.99
eBook Nov 2017 204 pages 1st Edition
eBook
€8.99 €19.99
Paperback
€24.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €19.99
Paperback
€24.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Practical Data Wrangling

Programming with Data

It takes a lot of time and effort to deliver data in a format that is ready for its end use. Let's use an example of an online gaming site that wants to post the high score for each of its games every month. In order to make this data available, the site's developers would need to set up a database to keep data on all of the scores. In addition, they would need a system to retrieve the top scores every month from that database and display it to the end users.

For the users of our hypothetical gaming site, getting this month's high scores is fairly straightforward. This is because finding out what the high scores are is a rather general use case. A lot of people will want that specific data in that specific form, so it makes sense to develop a system to deliver the monthly high scores.

Unlike the users of our hypothetical gaming site, data programmers have very specialized use cases for the data that they work with. A data journalist following politics may want to visualize trends in government spending over the last few years. A machine learning engineer working in the medical industry may want to develop an algorithm to predict a patient's likelihood of returning to the hospital after a visit. A statistician working for the board of education may want to investigate the correlation between attendance and test scores. In the gaming site example, a data analyst may want to investigate how the distribution of scores changes based on the time of the day.

A short side note on terminology
Data science as an all encompassing term can be a bit elusive. As it is such a new field, the definition of a data scientist can change depending on who you ask. To be more general, the term data programmer will be used in this book to refer to anyone who will find data wrangling useful in their work.

Drawing insight from data requires that all the information that is needed is in a format that you can work with. Organizations that produce data (for example, governments, schools, hospitals, and web applications) can't anticipate the exact information that any given data programmer might need for their work. There are too many possible scenarios to make it worthwhile. Data is therefore generally made available in its raw format. Sometimes this is enough to work with, but usually it is not. Here are some common reasons:

  • There may be extra steps involved in getting the data
  • The information needed may be spread across multiple sources
  • Datasets may be too large to work with in their original format
  • There may be far more fields or information in a particular dataset than needed
  • Datasets may have misspellings, missing fields, mixed formats, incorrect entries, outliers, and so on
  • Datasets may be structured or formatted in a way that is not compatible with a particular application

Due to this, it is often the responsibility of the data programmer to perform the following functions:

  • Discover and gather the data that is needed (getting data)
  • Merge data from different sources if necessary (merging data)
  • Fix flaws in the data entries (cleaning data)
  • Extract the necessary data and put it in the proper structure (shaping data)
  • Store it in the proper format for further use (storing data)

This perspective helps give some context to the relevance and importance of data wrangling. Data wrangling is sometimes seen as the grunt work of the data programmer, but it is nevertheless an integral part of drawing insights from data. This book will guide you through the various skill sets, most common tools, and best practices for data wrangling. In the following section, I will break down the tasks involved in data wrangling and provide a broad overview of the rest of the book. I will discuss the following steps in detail and provide some examples:

  • Getting data
  • Cleaning data
  • Merging and shaping data
  • Storing data

Following the high-level overview, I will briefly discuss Python and R, the tools used in this book to conduct data wrangling. 

Understanding data wrangling

Data wrangling, broadly speaking, is the process of gathering data in its raw form and molding it into a form that is suitable for its end use. Preparing data for its end use can branch out into a number of different tasks based on the exact use case. This can make it rather hard to pin down exactly what data wrangling entails, and formulate how to go about it. Nevertheless, there are a number of common steps in the data wrangling process, as outlined in the following subsections. The approach that I will take in this book is to introduce a number of tools and practices that are often involved in data wrangling. Each of the chapters will consist of one or more exercises and/or projects that will demonstrate the application of a particular tool or approach. 

Getting and reading data

The first step is to retrieve a dataset and open it with a program capable of manipulating the data. The simplest way of retrieving a dataset is to find a data file. Python and R can be used to open, read, modify, and save data stored in static files. In Chapter 3, Reading, Exploring, and Modifying Data - Part I, I will introduce the JSON data format and show how to use Python to read, write and modify JSON data. In Chapter 4Reading, Exploring, and Modifying Data - Part II, I will walk through how to use Python to work with data files in the CSV and XML data formats. In Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio, I will introduce R and Rstudio, and show how to use R to read and manipulate data. 

Larger data sources are often made available through web interfaces called application programming interfaces (APIs). APIs allow you to retrieve specific bits of data from a larger collection of data. Web APIs can be great resources for data that is otherwise hard to get. In Chapter 8, Getting Data from the Web, I discuss APIs in detail and walk through the use of Python to extract data from APIs.

Another possible source of data is a database. I won't go into detail on the use of databases in this book, though in Chapter 9, Working with Large Datasets, I will show how to interact with a particular database using Python.

Databases are collections of data that are organized to optimize the quick retrieval of data. They can be particularly useful when we need to work incrementally on very large datasets, and of course may be a source of data.

Cleaning data

When working with data, you can generally expect to find human errors, missing entries, and numerical outliers. These types of errors usually need to be corrected, handled, or removed to prepare a dataset for analysis.

In Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, I will demonstrate how to use regular expressions, a tool to identify, extract, and modify patterns in text data. Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions, includes a project to use regular expressions to extract street names.

In Chapter 6Cleaning Numerical Data - An Introduction to R and Rstudio, I will demonstrate how to use RStudio to conduct two common tasks for cleaning numerical data: outlier detection and NA handling.

Shaping and structuring data

Preparing data for its end use often requires both structuring and organizing the data in the correct manner. 

To illustrate this, suppose you have a hierarchical dataset of city populations, as shown in Figure 01:

Figure 01: Hierarchical structure of the population of cities

If the goal is to create a histogram of city populations, the previous data format would be hard to work with. Not only is the information of the city populations nested within the data structure, but it is nested to varying degrees of depth. For the purposes of creating a histogram, it is better to represent the data as a list of numbers, as shown in Figure 02:

Figure 02: List of populations for histogram visualization

Making structural changes like this for large datasets requires you to build programs that can extract the data from one format and put it into another format. Shaping data is an important part of data wrangling because it ensures that the data is compatible with its intended use. In Chapter 4Reading, Exploring, and Modifying Data - Part II, I will walk through exercises to convert between data formats.

Changing the form of data does not necessarily need to involve changing its structure. Changing the form of a dataset can involve filtering the data entries, reducing the data by category, changing the order of the rows, and changing the way columns are set up.

All of the previously mentioned tasks are features of the dplyr package for R. In Chapter 7, Simplifying Data Manipulation with dplyr, I will show how to use dplyr to easily and intuitively manipulate data.

Storing data

The last step after manipulating a dataset is to store the data for future use. The easiest way to do this is to store the data in a static file. I show how to output the data to a static file in Python in Chapters 3Reading, Exploring, and Modifying Data - Part I and Chapter 4, Reading, Analyzing, Modifying, and Writing Data - Part II. I show how to do this in R in Chapter 6, Cleaning Numerical Data - An Introduction to R and Rstudio.

When working with large datasets, it can be helpful to have a system that allows you to store and quickly retrieve large amounts of data when needed.

In addition to being a potential source of data, databases can be very useful in the process of data wrangling as a means of storing data locally. In Chapter 9Working with Large Datasets, I will briefly demonstrate the use of databases to store data.

The tools for data wrangling

The most popular languages used for data wrangling are Python and R. I will use the remaining part of this chapter to introduce Python and R, and briefly discuss the differences between them.

Python

Python is a generalized programming language used for everything from web development (Django and Flask) to game development, and for scientific and numerical computation. See Python.org/about/apps/.

Python is really useful for data wrangling and scientific computing in general because it emphasizes simplicity, readability, and modularity.

To see this, take a look at a Python implementation of the hello world program, which prints the words Hello World!:

Print("Hello World!")

To do the same thing in Java, another popular programming language, we need something a bit more verbose:

System.out.println("Hello World!");

While this may not seem like a huge difference, extra research and consultation of documentation can add up, adding time to the data wrangling process.

Python also has built-in data structures that are relatively flexible in the way that they handle data.

Data structures are abstractions that help organize the data in a program for easy manipulation. We will explore the various data structures in Python and R in Chapter 2, Introduction to Programming in Python.

This contributes to Python's relative ease of use, particularly when working with data on a low level.

Finally, because of Python's modularity and popularity within the scientific community, there are a number of packages built around Python that can be quite useful to us in data wrangling.

Packages/modules/libraries are extensions of a language, or prewritten code in that language--typically built by individual users and the open source community--that add on functionality that is not built into the language. They can be imported in a program to include new tools. We will be leveraging packages throughout the book, both in R and Python, to extract, read, clean, shape, and store data.

R

R is both a programming language and an environment built specifically for statistical computing. This definition has been taken from the R website, r-project.org/about.html:

The term 'environment' is intended to characterize [R] as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

In other words, one of the major differences between R and Python is that some of the most common functionalities for working with data--data handling and storage, visualization, statistical computation, and so on--come built in. A good example of this is linear modeling, a basic statistical method for modelling numerical data.

In R, linear modeling is a built-in functionality that is made very intuitive and straightforward, as we will see in Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions. There are a number of ways to do linear modeling in Python, but they all require using external libraries and often doing extra work to get the data in the right format.

R also has a built-in data structure called a dataframe that can make manipulation of tabular data more intuitive. 

The big takeaway here is that there are benefits and trade-offs to both languages. In general, being able to use the right tool for the job can save an immense amount of time spent on data wrangling. It is therefore quite useful as a data programmer to have a good working knowledge of each language and know when to use one or the other. 

Summary

This chapter has provided an overall context for the purpose, subject matter, and programming languages in this book. In summary, data wrangling is important because data in its original raw format is rarely prepared for its end use to begin with. Data wrangling involves getting and reading data, cleaning data, merging and shaping data, and storing data. In this book, data wrangling will be conducted using the R and Python programming languages.

In the next chapter, I will dive into Python, with an introduction to Python programming. I will introduce basic principals of programming and features of the Python language that will be used throughout the rest of the book. If you are already familiar with Python, you may want to skip ahead or skim through the following chapter.

In Chapter 3Reading, Exploring, and Modifying Data - Part I, and Chapter 4Reading, Exploring, and Modifying Data - Part II, I will take a generalized programming approach to data wrangling. Chapter 3Reading, Exploring, and Modifying Data - Part I, and Chapter 4Reading, Exploring, and Modifying Data - Part II, will discuss how to use Python programming to read, write, and manipulate data using Python.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • ? This easy-to-follow guide takes you through every step of the data wrangling process in the best possible way
  • ? Work with different types of datasets, and reshape the layout of your data to make it easier for analysis
  • ? Get simple examples and real-life data wrangling solutions for data pre-processing

Description

Around 80% of time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis and reporting. Python and R are considered a popular choice of tool for data analysis, and have packages that can be best used to manipulate different kinds of data, as per your requirements. This book will show you the different data wrangling techniques, and how you can leverage the power of Python and R packages to implement them. You’ll start by understanding the data wrangling process and get a solid foundation to work with different types of data. You’ll work with different data structures and acquire and parse data from various locations. You’ll also see how to reshape the layout of data and manipulate, summarize, and join data sets. Finally, we conclude with a quick primer on accessing and processing data from databases, conducting data exploration, and storing and retrieving data quickly using databases. The book includes practical examples on each of these points using simple and real-world data sets to give you an easier understanding. By the end of the book, you’ll have a thorough understanding of all the data wrangling concepts and how to implement them in the best possible way.

Who is this book for?

If you are a data scientist, data analyst, or a statistician who wants to learn how to wrangle your data for analysis in the best possible manner, this book is for you. As this book covers both R and Python, some understanding of them will be beneficial.

What you will learn

  • ? Read a csv file into python and R, and print out some statistics on the data
  • ? Gain knowledge of the data formats and programming structures involved in retrieving API data
  • ? Make effective use of regular expressions in the data wrangling process
  • ? Explore the tools and packages available to prepare numerical data for analysis
  • ? Find out how to have better control over manipulating the structure of the data
  • ? Create a dexterity to programmatically read, audit, correct, and shape data
  • ? Write and complete programs to take in, format, and output data sets

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Nov 15, 2017
Length: 204 pages
Edition : 1st
Language : English
ISBN-13 : 9781787283671
Vendor :
RStudio
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Nov 15, 2017
Length: 204 pages
Edition : 1st
Language : English
ISBN-13 : 9781787283671
Vendor :
RStudio
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 90.97
Statistics for Data Science
€32.99
Practical Data Wrangling
€24.99
Python Machine Learning, Second Edition
€32.99
Total 90.97 Stars icon
Banner background image

Table of Contents

9 Chapters
Programming with Data Chevron down icon Chevron up icon
Introduction to Programming in Python Chevron down icon Chevron up icon
Reading, Exploring, and Modifying Data - Part I Chevron down icon Chevron up icon
Reading, Exploring, and Modifying Data - Part II Chevron down icon Chevron up icon
Manipulating Text Data - An Introduction to Regular Expressions Chevron down icon Chevron up icon
Cleaning Numerical Data - An Introduction to R and RStudio Chevron down icon Chevron up icon
Simplifying Data Manipulation with dplyr Chevron down icon Chevron up icon
Getting Data from the Web Chevron down icon Chevron up icon
Working with Large Datasets Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.