The most popular languages used for data wrangling are Python and R. I will use the remaining part of this chapter to introduce Python and R, and briefly discuss the differences between them.
The tools for data wrangling
Python
Python is a generalized programming language used for everything from web development (Django and Flask) to game development, and for scientific and numerical computation. See Python.org/about/apps/.
Python is really useful for data wrangling and scientific computing in general because it emphasizes simplicity, readability, and modularity.
To see this, take a look at a Python implementation of the hello world program, which prints the words Hello World!:
Print("Hello World!")
To do the same thing in Java, another popular programming language, we need something a bit more verbose:
System.out.println("Hello World!");
While this may not seem like a huge difference, extra research and consultation of documentation can add up, adding time to the data wrangling process.
Python also has built-in data structures that are relatively flexible in the way that they handle data.
This contributes to Python's relative ease of use, particularly when working with data on a low level.
Finally, because of Python's modularity and popularity within the scientific community, there are a number of packages built around Python that can be quite useful to us in data wrangling.
R
R is both a programming language and an environment built specifically for statistical computing. This definition has been taken from the R website, r-project.org/about.html:
In other words, one of the major differences between R and Python is that some of the most common functionalities for working with data--data handling and storage, visualization, statistical computation, and so on--come built in. A good example of this is linear modeling, a basic statistical method for modelling numerical data.
In R, linear modeling is a built-in functionality that is made very intuitive and straightforward, as we will see in Chapter 5, Manipulating Text Data - An Introduction to Regular Expressions. There are a number of ways to do linear modeling in Python, but they all require using external libraries and often doing extra work to get the data in the right format.
R also has a built-in data structure called a dataframe that can make manipulation of tabular data more intuitive.
The big takeaway here is that there are benefits and trade-offs to both languages. In general, being able to use the right tool for the job can save an immense amount of time spent on data wrangling. It is therefore quite useful as a data programmer to have a good working knowledge of each language and know when to use one or the other.