You're reading from Practical Data Wrangling Expert techniques for transforming your raw data into a valuable source for analytics

Product type Paperback

Published in Nov 2017

Publisher Packt

ISBN-13 9781787286139

Length 204 pages

Edition 1st Edition

Languages

Python

Tools

RStudio

Concepts

Data Analysis

Author (1):

Allan Visochek

View More author details

The tools for data wrangling

The most popular languages used for data wrangling are Python and R. I will use the remaining part of this chapter to introduce Python and R, and briefly discuss the differences between them.

Python

Python is a generalized programming language used for everything from web development (Django and Flask) to game development, and for scientific and numerical computation. See Python.org/about/apps/.

Python is really useful for data wrangling and scientific computing in general because it emphasizes simplicity, readability, and modularity.

To see this, take a look at a Python implementation of the hello world program, which prints the words Hello World!:

Print("Hello World!")

To do the same thing in Java, another popular programming language, we need something a bit more verbose:

System.out.println("Hello World!");

While this may not seem like a huge difference, extra research and consultation of documentation can add up, adding time to the data wrangling process.

Python also has built-in data structures that are relatively flexible in the way that they handle data.

Data structures are abstractions that help organize the data in a program for easy manipulation. We will explore the various data structures in Python and R in Chapter 2, Introduction to Programming in Python.

This contributes to Python's relative ease of use, particularly when working with data on a low level.

Finally, because of Python's modularity and popularity within the scientific community, there are a number of packages built around Python that can be quite useful to us in data wrangling.

Packages/modules/libraries are extensions of a language, or prewritten code in that language--typically built by individual users and the open source community--that add on functionality that is not built into the language. They can be imported in a program to include new tools. We will be leveraging packages throughout the book, both in R and Python, to extract, read, clean, shape, and store data.

R

R is both a programming language and an environment built specifically for statistical computing. This definition has been taken from the R website, r-project.org/about.html:

The term 'environment' is intended to characterize [R] as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

In other words, one of the major differences between R and Python is that some of the most common functionalities for working with data--data handling and storage, visualization, statistical computation, and so on--come built in. A good example of this is linear modeling, a basic statistical method for modelling numerical data.

In R, linear modeling is a built-in functionality that is made very intuitive and straightforward, as we will see in Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions. There are a number of ways to do linear modeling in Python, but they all require using external libraries and often doing extra work to get the data in the right format.

R also has a built-in data structure called a dataframe that can make manipulation of tabular data more intuitive.

The big takeaway here is that there are benefits and trade-offs to both languages. In general, being able to use the right tool for the job can save an immense amount of time spent on data wrangling. It is therefore quite useful as a data programmer to have a good working knowledge of each language and know when to use one or the other.

You're reading from Practical Data Wrangling Expert techniques for transforming your raw data into a valuable source for analytics

Table of Contents (10) Chapters

The tools for data wrangling

Python

R

Authors (1)

Personalised recommendations for you