Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Applied Data Visualization with R and ggplot2
Applied Data Visualization with R and ggplot2

Applied Data Visualization with R and ggplot2: Create useful, elaborate, and visually appealing plots

Arrow left icon
Profile Icon Dr. Tania Moulik
Arrow right icon
$32.99
Paperback Sep 2018 140 pages 1st Edition
eBook
$17.99 $25.99
Paperback
$32.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Dr. Tania Moulik
Arrow right icon
$32.99
Paperback Sep 2018 140 pages 1st Edition
eBook
$17.99 $25.99
Paperback
$32.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$17.99 $25.99
Paperback
$32.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Applied Data Visualization with R and ggplot2

Basic Plotting in ggplot2

This chapter will cover basic concepts of ggplot2 and the Grammar of Graphics, using illustrative examples. You will generate solutions to problems of increasing complexity throughout the book. Lastly, you will master advanced plotting techniques, which will enable you to add more detail and increase the quality of your graphics.

In order to use ggplot2, you will first need to install R and RStudio. R is a programming language that is widely used for advanced modeling, statistical computing, and graphic production. R is considered the base package, while RStudio is a graphical interface (or IDE) that is based on R. Visualization is a very important aspect of data analysis, and it has its own underlying grammar (similar to the English language). There are many aspects of data analysis, and visualization is one of them. So, before we go further, let's discuss visualization in more detail.

By the end of this chapter, you will be able to:

  • Distinguish between different kinds of variables
  • Create simple plots and geometric objects, using qplot and ggplot2
  • Determine the most appropriate visualization by comparing variables
  • Utilize Grammar of Graphics concepts to improve plots in ggplot2

Introduction to ggplot2

ggplot2 is a visualization package in R. It was developed in 2005 and it uses the concept of the Grammar of Graphics to build a plot in layers and scales. This is the syntax used for the different components (aesthetics) of a geometric object. It also involves the grammatical rules for creating a visualization.

ggplot2 has grown in popularity over the years. It's a very powerful package, and its impressive scope has been enabled by the underlying grammar, which gives the user a very file level of control - making it perfect for a range of scenarios. Another great feature of ggplot 2 is that it is programmatic; hence, its visuals are reproducible. The ggplot2 package is open source, and its use is rapidly growing across various industries. Its visuals are flexible, professional, and can be created very quickly.

Read more about the top companies using R at https://www.listendata.com/2016/12/companies-using-r.html.

You can find out more about the role of a data scientist at https://www.innoarchitech.com/what-is-data-science-does-data-scientist-do/.

Similar Packages

Other visualization packages exist, such as matplotlib (in Python) and Tableau. The matplotlib and ggplot2 packages are equally popular, and they have similar features. Both are open source and widely used. Which one you would like to use may be a matter of preference. However, although both are programmatic and easy to use, since R was built with statisticians in mind, ggplot2 is considered to have more powerful graphics. More discussion on this topic can be found in the chapter later. Tableau is also very powerful, but it is limited in terms of statistical summaries and advanced data analytics. Tableau is not programmatic, and it is more memory intensive because it is completely interactive.

Excel has also been used for data analysis in the past, but it is not useful for processing the large amounts of data encountered in modern technology. It is interactive and not programmatic; hence, charts and graphs have to be made with interactivity and need to be updated every time more data is added. Packages such as ggplot2 are more powerful in that once the code is written, ggplot is independent of increases in the data, as long as the data structure is maintained. Also, ggplot2 provides a greater number of advanced plots that are not available in Excel.

Read more about Excel versus R at https://www.jessesadler.com/post/excel-vs-r/.

Read more about matplotlib versus R at http://pbpython.com/visualization-tools-1.html.

Read more about matplotlib versus ggplot at https://shiring.github.io/r_vs_python/2017/01/22/R_vs_Py_post.html.

The RStudio Workspace

So, before we go further, let's discuss visualization in more detail. Our first task is to load a dataset. To do so, we need to load certain packages in RStudio. Take a look at the screenshot of a typical RStudio layout, as follows:

Loading and Exploring a Dataset Using R Functions

In this section, we'll load and explore a dataset using R functions. Before starting with the implementation, check the version by typing version in the console and checking the details, as follows:

Let's begin by following these steps:

  1. Install the following packages and libraries:
install.packages("ggplot2")
install.packages("tibble")
install.packages("dplyr")
install.packages("Lock5Data")
  1. Get the current working directory by using the getwd(".") command:
[1] "C:/Users/admin/Documents/GitHub/Applied-DataVisualization-with-ggplot2-and-R"
  1. Set the current working directory to Chapter 1 by using the following command:
setwd("C:/Users/admin/Documents/GitHub/Applied-DataVisualization-with-ggplot2-and-R/Lesson1")
  1. Use the require command to open the template_Lesson1.R file, which has the necessary libraries.
  2. Read the following data file, provided in the data directory:
df_hum <- read.csv("data/historical-hourly-weather-data/humidity.csv")
When we used read.csv, a structure called a data frame was created in R; which we are all familiar with it. Let's type some commands to get an overall impression of our data.

Let's retrieve some parameters of the dataset (such as the number of rows and columns) and display the different variables and their data types.

The following libraries have now been loaded:

  • Graphical visualization package:
require("ggplot2") 
  • Build a data frame or list and some other useful commands:
require("tibble") 
  • A built-in dataset package in R:
require("Lock5Data") 

Use the following commands to determine the data frame details, as follows:

#Display the column names
colnames(df_hum)

Take a look at the output screenshot, as shown here:

Use the following command:

#Number of columns and rows
ndim(df_hum)

A summary of the data frame can be seen with the following code:

str(df_hum)

Take a look at the output screenshot, as shown here:

The Main Concepts of ggplot2

ggplot2 is based on two main concepts: geometric objects and the Grammar of Graphics. The geometric objects in ggplot2 are the different visual structures that are used to visualize data. We will be going over them one by one. The Grammar of Graphics is the syntax that we use for the different aesthetics of a graph, such as the coordinate scale, the fonts, the color themes, and so on. ggplot2 uses a layered Grammar of Graphics concept, which allows us to build a plot in layers. We will work on some aspects of the Grammar of Graphics in this chapter, and will go into further details in the next chapter.

Types of Variables

Variables can be of different types and, sometimes, different software uses different names for the same variables. So, let's get familiar with the different kinds of variables that we will work with:

  • Continuous: A continuous variable can take an infinite number of values, such as time or weight. They are of the numerical type.
  • Discrete: A variable whose values are whole numbers (counts) is called a discrete variable. For example, the number of items bought by a customer in a supermarket is discrete.
  • Categorical: The values of a categorical variable are selected from a small group of categories. Examples include gender (male or female) and make of car (Mazda, Hyundai, Toyota, and so on). Categorical variables can be further categorized into ordinal and nominal variables, as follows:
    • Ordinal categorical variable: A categorical variable whose categories can be meaningfully ordered is called ordinal. For example, credit grades (AA, A, B, C, D, and E) are ordinal.
    • Nominal categorical variable: It does not matter which way the categories are ordered in tabular or graphical displays of the data; all orderings are equally meaningful. An example would be different kinds of fruit (bananas, oranges, apples, and so on).
    • Logical: A logical variable can only take two values (T/F).

The following table lists variables and the names that R uses for them; make sure to familiarize yourself with both nomenclatures.

The variable names used in R are as follows:


In R, whenever the factor data is listed, the number of levels is also given. A dataset can contain different kinds of variables, as discussed previously.

Exploring Datasets

In this section, we will use the built-in datasets to investigate the relationships between continuous variables, such as temperature and airquality. We'll explore and understand the datasets available in R.

Let's begin by executing the following steps:

  1. Type data() in the command line to list the datasets available in R. You should see something like the following:

  1. Choose the following datasets: mtcars, air quality, rock, and sleep.

The number of levels only applies to factor data.
  1. List two variables of each type, the dataset names, and the other columns of this table.
  2. To view the data type, use the str command (for example, str(airquality) ).

    Take a look at the following output screenshot:

  1. After viewing the preceding datasets, fill in the following table. The first entry has been completed for you. The following table includes all variables of the types num and int:

The outcome should be a completed table, similar to the following:


More details about variables can be found at http://www.statisticshowto.com/types-variables/.

Making Your First Plot

The ggplot2 function qplot (quick plot) is similar to the basic plot() function from the R package. It has the following syntax: qplot(). It can be used to build and combine a range of useful graphs; however, it does not have the same flexibility as the ggplot() function.

Plotting with qplot and R

Suppose that we want to visualize some of the variables in the built-in datasets. A dataset can contain different kinds of variables, as discussed previously. Here, the climate data includes numerical data, such as the temperature, and categorical data, such as hot or cold. In order to visualize and correlate different kinds of data, we need to understand the nomenclature of the dataset. We'll load a data file and understand the structure of the dataset and its variables by using the qplot and R base package. Let's begin by executing the following steps:

  1. Plot the temperature variable from the airquality dataset, with hist(airquality$Temp) .

hist is part of the built-in R graphics package.

  Take a look at the following output screenshot:

  1. Use qplot (which is part of the ggplot2 package) to plot a graph, using the same variables.
  1. Type the qplot(airquality$Temp) command to obtain the output, as shown in the following screenshot:

Analysis

The first plot was made in the built-in graphics package in R, while the second one was made using qplot, which is a plotting command in ggplot2. We can see that the two plots look very different. The plot is a histogram of the temperature.

We will discuss geometric objects later in this chapter, in order to understand the different types of histograms. 

The built-in graphics package in R does not have a lot of features, so ggplot2 has become the package of choice. For the next exercises, we will continue to investigate making plots using ggplot2.

Geometric Objects

In your mathematics class, you likely studied geometry, examining different shapes and the characteristics of those shapes, such as area, perimeter, and other factors. The geometric objects in ggplot2 are visual structures that are used to visualize data. They can be lines, bars, points, and so on.

Geometric objects are constructed from datasets. Before we construct some geometric objects, let's examine some datasets to understand the different kinds of variables.

Analyzing Different Datasets

We all love to talk about the weather. So, let's work with some weather-related datasets. The datasets contain approximately five years' worth of high-temporal resolution (hourly measurements) data for various weather attributes, such as temperature, humidity, air pressure, and so on. We'll analyze and compare the humidity and weather datasets.

Let's begin by implementing the following steps:

  1. Load the humidity dataset by using the following command:
df_hum <- read.csv("data/historical-hourly-weather-data/humidity.csv")
  1. Load the weather description dataset by using the following command:
df_desc <- read.csv("data/historical-hourly-weather-data/weather_description.csv")
  1. Compare the two datasets by using the str command.

The outcome will be the humidity levels of different cities, as follows:

The weather descriptions of different cities are shown as follows:

The different geometric objects that we will be working with in this chapter are as follows:

One-dimensional objects are used to understand and visualize the characteristics of a single variable, as follows:

  • Histogram
  • Bar chart

Two-dimensional objects are used to visualize the relationship between two variables, as follows:

  • Bar chart
  • Boxplot
  • Line chart
  • Scatter plot

Although geometric objects are also used in base R, they don't follow the structure of the Grammar of Graphics and have different naming conventions, as compared to ggplot2. This is an important distinction, which we will look at in detail later.

Histograms

Histograms are used to group and represent numerical (continuous) variables. For example, you may want to know the distribution of voters' ages in an election. A histogram is often confused with a bar chart; however, a bar chart is more general, and we will cover those later. In a histogram, a continuous variable is grouped into bins of specific sizes and the bins have a range that covers the maximum and minimum of the variable in question.

Histograms can be classified as follows:

  • Unimodal: A distribution with a single maximum or mode; for example, a normal distribution:
    • A normal distribution (or a bell-shaped curve) is symmetrical. An example is the grade distribution of students in a class. A unimodal distribution may or may not be symmetrical. It can be positively or negatively skewed, as well.
    • Positively or negatively skewed (also known as right-skewed or left-skewed): Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined.
    • A left-skewed distribution has a long tail to the left while a right-skewed distribution has a long tail to the right. An example of a right-skewed distribution might be the US household income, with a long tail of higher-income groups.
  • Bimodal: Bimodal distribution resembles the back of a two-humped camel. It shows the outcomes of two processes, with different distributions that are combined into one set of data. For example, you might expect to see a bimodal distribution in the height distribution of an entire population. There would be a peak around the average height of a man, and a peak around the typical height of a woman.
  • Unitary distribution: This distribution follows a uniform pattern that has approximately the same number of values in each group. In the real world, one can only find approximately uniform distributions. An example is the speed of a car versus time if moving at constant speed (zero acceleration), or the uniform distribution of heat in a microwave:

Let's take a look at another image:

It's important to study the shapes of distributions, as they can reveal a lot about the nature of data. We will see some real-world examples of histograms in the datasets that we will explore.

We discussed the different kinds of geometric objects that we will be working on, and we created our fist plot using two different methods (qplot and hist). Now, let's use another command: ggplot. We will use the humidity data that we loaded previously.

As seen in the preceding section, we can create a default histogram by using one of the commands in the base R package: hist. This is seen in the following command:

hist(df_hum$Vancouver)

The default histogram that will be created is as follows:

Creating a Histogram Using qplot and ggplot

In this section, we want to visualize the humidity distribution for the city of Vancouver. We'll create a histogram for humidity data using qplot and ggplot.

Let's begin by implementing the following steps:

  1. Create a plot with RStudio by using the following command: qplot(df_hum$Vancouver):

  1. Use ggplot to create the same plot using the following command:
ggplot(df_hum,aes(x=Vancouver))
This command does not do anything; ggplot2 requires the name of the object that we wish to make. To make a histogram, we have to specify the geom type (in other words, a histogram). aes stands for aesthetics, or the quantities that get plotted on the x- and y-axes, and their qualities. We will work on changing the aesthetics later, in order to visualize the plot more effectively.

Notice that there are some warning messages, as follows:

'stat_bin()' using 'bins = 30'. Pick better value with 'binwidth'.
Warning message:
Removed 1826 rows containing non-finite values (stat_bin).

You can ignore these messages; ggplot automatically detects and removes null or NA values.

  1. Obtain the histogram with ggplot by using the following command:
ggplot (df_hum, aes(x=Vancouver)) + geom_histogram() 

You'll see the following output:

Here's the output code:

require("ggplot2")
require("tibble")
#Load a data file - Read the Humidity Data
df_hum <- read.csv("data/historical-hourly-weather-data/humidity.csv")
#Display the summary
str(df_hum)
qplot(df_hum$Vancouver)
ggplot(df_hum, aes(x=Vancouver)) + geom_histogram()
Refer to the complete code at https://goo.gl/tu7t4y.

In order for ggplot to work, you will need to specify the geometric object. Note that the column name should not be enclosed in strings.

Activity: Creating a Histogram and Explaining its Features

Scenario

Histograms are useful when you want to find the peak and spread in a distribution. For example, suppose that a company wants to see what its client age distribution looks like. A two-dimensional distribution can show relationships; for example, one can create a scatter plot of the incomes and ages of credit card holders.

Aim

To create and analyze histograms for the given dataset.

Prerequisites

You should be able to use ggplot2 to create a histogram.


This is an empty code, wherein the libraries are already loaded. You will be writing your code here.

Steps for Completion

  1. Use the template code and load the required datasets.
  2. Create the histogram for two cities.
  3. Analyze and compare two histograms to determine the point of difference.

Outcome

Two histograms should be created and compared. The complete code is as follows:

df_t <- read.csv("data/historical-hourly-weather-data/temperature.csv")
ggplot(df_t,aes(x=Vancouver))+geom_histogram()
ggplot(df_t,aes(x=Miami))+geom_histogram()

Refer to the complete code at https://goo.gl/tu7t4y.

Take a look at the following output histogram:

From the preceding plot, we can determine the following information:

  • Vancouver's maximum temperature is around 280.
  • It ranges between 260 and 300.
  • It's a right-skewed distribution.

Take a look at the following output histogram:

From the preceding plot, we can determine the following information:

  • Miami's maximum temperature is around 300
  • It ranges between 280 and 308
  • It's a left-skewed distribution

Differences

  1. Miami's temperature plot is skewed to the right, while Vancouver's is to the left.
  2. The maximum temperature is higher for Miami.

Creating Bar Charts

Bar charts are more general than histograms, and they can represent both discrete and continuous data. They can even be used to represent categorical variables. A bar chart uses a horizontal or vertical rectangular bar that levels of at an appropriate level. A bar chart can be used to represent various quantities, such as frequency counts and percentages.

We will use the weather description data to create a bar chart. To create a bar chart, the geometric object used is geom_bar().

The syntax is as follows:

ggplot(….) + geom_bar(…)

If we use the glimpse or str command to view the weather data, we will get the following results:


You cannot use a histogram for a categorical type of variable.

Creating a One-Dimensional Bar Chart

Use the ggplot(df_vanc,aes(x=Vancouver)) + geom_bar() command to obtain the following chart:

Observations

Vancouver has clear weather, for the most part. It rained about 10,000 times for the dataset provided. Snowy periods are much less frequent.

We will now perform two exercises, creating a one-dimensional bar chart and a two-dimensional bar chart. A one-dimensional bar chart can give us the counts or frequency of a given variable. A two-dimensional bar chart can give us the
relationship between the variables.

In this section, we'll count the number of times each type of weather occurs in Seattle and compare it to Vancouver.

Let's begin by following these steps:

  1. Use ggplot2 and geom_bar in conjunction to create the bar chart.
  2. Use the data frame that we just created, but with Seattle instead of Vancouver, as follows:
ggplot(df_vanc,aes(x=Seattle)) + geom_bar() 
  1. Now, compare the two bar charts and answer the following questions:
  • Approximately how many times was Vancouver cloudy? (Round to 2 significant figures.)
  • Which of the two cities sees a greater amount of rain?
  • What is the percentage of rainy days versus clear days? (Add the two counts and give the percentage of days it rains.)
  • According to this dataset, which city gets a greater amount of snow?

You should see the following output:


Refer to the complete code at https://goo.gl/tu7t4y.

Answers

  • Vancouver was cloudy 13,000 times. (Note that 12,000 is also acceptable.)
  • Seattle sees a greater amount of rain.
It rained on approximately 40% of the days.
  • Vancouver gets a greater amount of snow.

A two-dimensional bar chart can be used to plot the sum of a continuous variable versus a categorical or discrete variable. For example, you might want to plot the total amount of rainfall in different weather conditions, or the total amount of sales in different months.

Creating a Two-Dimensional Bar Chart

In this section, we'll create a two-dimensional bar chart for the total sales of a company in different months.

Let's begin by following these steps:

  1. Load the data. Add the line require (Lock5Data) into your code. You should have installed this package previously.
  2. Review the data with the glimpse(RetailSales) command.
  3. Plot a graph of Sales versus Month.

Here, Month is a categorical variable, while Sales is a continuous variable of the type <dbl>.
  1. Use ggplot + geom_bar(..) to plot this data, as follows:
ggplot(RetailSales,aes(x=Month,y=Sales)) + geom_bar(stat="identity")

A screenshot of the expected outcome is as follows:

Analyzing and Creating Boxplots

A boxplot (also known as a box and whisker diagram) is a standard way of displaying the distribution of data based on a file-number summary: minimum, first quartile, median, third quartile, and maximum. Boxplots can represent how a continuous variable is distributed for different categories; one of the axes will be a categorical variable, while the other will be a continuous variable. In the default boxplot, the central rectangle spans the first quartile to the third quartile (called the interquartile range, or IQR). A segment inside of the rectangle shows the median, and the lines (whiskers) above and below the box indicate the locations of the minimum and maximum, as shown in the following diagram:

The upper whisker extends from the hinge to the largest and smallest values of ± 1.5 * IQR from the hinge. Here, we can see the humidity data as a function of the month. Data beyond the end of the whiskers are called outliers, and are represented as circles, as seen in the following chart:

You'll get the preceding chart by using the following code:

ggplot(df_hum,aes(x=month,y=Vancouver)) + geom_boxplot() 

Creating a Boxplot for a Given Dataset

In this section, we'll create a boxplot for monthly temperature data for Seattle and San Francisco, and compare the two (given two points).

Let's begin by implementing the following steps:

  1. Create the two boxplots.
  2. Display them side by side in your Word document.
  3. Provide two points of comparison between the two. You can comment on how the medians compare, how uniform the distributions look, the maximum and minimum humidity, and so on.

Refer to the complete code at https://goo.gl/tu7t4y.

The following observations can be noted:

The humidity is more uniform for San Francisco:

The median humidity for San Francisco is about 75:

Compare this to the humidity data for Seattle and San Francisco on the following websites (scroll down and look for the humidity plots). You should see a similar trend:

https://weather-and-climate.com/average-monthly-Rainfall-Temperature-Sunshine,Seattle,United-States-of-America

https://weather-and-climate.com/average-monthly-Rainfall-Temperature-Sunshine,San-Francisco,United-States-of-America

Scatter Plots

A scatter plot shows the relationship between two continuous variables. Let's create a scatter plot of distance versus time for a car that is accelerating and traveling with an initial velocity. We will generate some random time points according to a function. The relationship between distance and time for a speeding car is as follows:

We can draw a scatter plot to show the relationship between distance and time with the following code:

ggplot(df,aes(x=time,y=distance)) + geom_point()

We can see a positive correlation, meaning that as time increases, distance increases. Take a look at the following code:

a=3.4
v0=27
time <- runif(50, min=0, max=200)
distance <- sapply(time, function(x) v0*x + 0.5*a*x^2)
df <- data.frame(time,distance)
ggplot(df,aes(x=time,y=distance)) + geom_point()

The outcome is a positive correlation: as time increases, distance increases:

The correlation can also be zero (for no relationship) or negative (as x increases, y decreases).

Line Charts

A line chart shows the relationship between two variables; it is similar to a scatter plot, but the points are connected by line segments. One difference between the usage of a scatter plot and a line chart is that, typically, it's more meaningful to use the line chart if the variable being plotted on the x-axis has a one-to-one relationship with the variable being plotted on the y-axis. A line chart should be used when you have enough data points, so that a smooth line is meaningful to see a functional dependence:

We could have also used a line chart for the previous plot. The advantage of using a line chart is that the discrete nature goes away and you can see trends more easily, while the functional form is more effectively visualized.

If there is more than one y value for a given x, the data needs to be grouped by the x value; then, one can show the features of interest from the grouped data, such as the mean, median, maximum, minimum, and so on. We will use grouping in the next section.

Creating a Line Chart

In this section, we'll create a line chart to plot the mean humidity against the month. Lets's begin by implementing the following steps:

  1. Convert the months into numerical integers, as follows:
df_hum$monthn <- as.numeric(df_hum$month)
  1. Group the humidity by month and remove NAs, as follows:
gp1 <- group_by(df_hum,monthn)
  1. Create a summary of the group using the mean and median.
  2. Now, use the geom_line() command to plot the line chart (refer to the code).

The following plots are obtained:


Refer to the complete code at https://goo.gl/tu7t4y.

Take a look at the output line chart:

Activity: Creating One- and Two-Dimensional Visualizations with a Given Dataset

Scenario

Suppose that we are in a company, and we have been given an unknown dataset and would like to create similar plots. For example, we have some educational data, and we would like to know what courses are the most popular, or the gender distribution among students, or how satisfied the parents/students are with the courses. We will use the new dataset, along with our own knowledge, to get some information on the preceding points.

Aim

To create one- and two-dimensional visualizations for the new dataset and the given variables.

Steps for Completion

  1. Load the datasets.
  2. Choose the appropriate visualization.
  3. Create the desired 1D visualization.
  4. Create two-dimensional boxplots or scatter plots and note your observations.

Outcome

Three one-dimensional plots and three two-dimensional plots should be created, with the following axes (count versus topic) and observations. (Note that the students may provide different observations, so the instructor should verify the answers. The following observations are just examples.)


Refer to the complete code at https://goo.gl/tu7t4y.

One-Dimensional Plots

This visual was chosen because Topic is a categorical variable, and I wanted to see the frequency of each topic:

Observation

You can see that IT is the most popular subject:

gender is a categorical variable; you can chose a bar chart because you wanted to see the frequency of each topic.

Observation

You can observe that more males are registered in this institute from the following histogram:

VisitedResources is numerical, so you can choose a histogram to visualize it.

Observation

It's a bimodal histogram with two peaks, around 12 and 85.

Two-Dimensional Plots

Take a look at the following 2D plots:

Plot 1:

Plot 2:

Plot 3:

Observations

  • I see that there is a weak positive correlation between AnnouncementsView and VisitedResources.
  • Students in Math hardly visit resources; the median is the lowest, at about 12.5.
  • Females participate in discussions more frequently, as their median and maximum are higher.
  • People in Biology visit resources the most.
  • The median number of discussions for females is 37.5.

Three-Dimensional Plots

It is also possible to plot using three-dimensional vectors. This creates a three-dimensional plot, which provides enhanced visualization for applications (for example, displaying three-dimensional spaces). Essentially, it is a graph of two functions, embedded into a three-dimensional environment.


Read more about three-dimensional plots at: https://octave.org/doc/v4.2.0/Three_002dDimensional-Plots.html.

The Grammar of Graphics

The Grammar of Graphics is the language used to describe the various components of a graphic that represent the data in a visualization. Here, we will explore a few aspects of the Grammar of Graphics, building upon some of the features in the graphics that we created in the previous topic. For example, a typical histogram has various components, as follows:

  • The data itself (x)
  • Bars representing the frequency of x at different values of x
  • The scaling of the data (linear)
  • The coordinate system (Cartesian)

All of these aspects are part of the Grammar of Graphics, and we will change these aspects to provide better visualization. In this chapter, we will work with some of the aspects; we will explore them further in the next chapter.


Read more about the Grammar of Graphics at https://cfss.uchicago.edu/dataviz_grammar_of_graphics.html.

Rebinning

In a histogram, data is grouped into intervals, or ranges of values, called bins. ggplot has a certain number of bins by default, but the default may not be the best choice every time. Having too many bins in a histogram might not reveal the shape of the
distribution, while having too few bins might distort the distribution. It is sometimes necessary to rebin a histogram, in order to get a smooth distribution.

Analyzing Various Histograms

Let's use the humidity data and the first plot that we created. It looks like the humidity values are discrete, which is why you can see discrete peaks in the data. In this section, we'll analyze the differences between unbinned and binned histograms.

Let's begin by implementing the following steps:

  1. Choosing a different type of binning can make the distribution more continuous; use the following code:
ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15)

You'll get the following output. Graph 1:

Graph 2:


Choosing a different type of binning can make the distribution more continuous, and one can then better understand the distribution shape. We will now build upon the graph, changing some features and adding more layers.
  1. Change the fill color to white by using the following command:
ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15,fill="white",color=1) 
  1. Add a title to the histogram by using the following command:
+ggtitle("Humidity for Vancouver city")
  1. Change the x-axis label and label sizes, as follows:
+xlab("Humidity")+theme(axis.text.x=element_text(size = 12),axis.text.y=element_text(size=12))

You should see the following output:


The full command should look as follows:

ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15,fill="white",color=1)+ggtitle("Humidity for Vancouver city")+xlab("Humidity")+theme(axis.text.x=element_text(size= 12),axis.text.y=element_text(size=12))

We can see that the second plot is a visual improvement, due to the following factors:

  • There is a title
  • The font sizes are visible
  • The histogram looks more professional in white

To see what else can be changed, type ?theme.

Changing Boxplot Defaults Using the Grammar of Graphics

In this section, we'll use the Grammar of Graphics to change defaults and create a better visualization.

Let's begin by implementing the following steps:

  1. Use the humidity data to create the same boxplot seen in the previous section, for plotting monthly data.
  2. Change the x- and y-axis labels appropriately (the x-axis is the month and the y-axis is the humidity).
  3. Type ?geom_boxplot in the command line, then look for the aesthetics, including the color and the fill color.
  4. Change the color to black and the fill color to green (try numbers from 1-6).
  5. Type ?theme to find out how to change the label size to 15. Change the x- and y-axis titles to size 15 and the color to red.

The outcome will be the complete code and the graphic with the correct changes:


Refer to the complete code at https://goo.gl/tu7t4y.

Activity: Improving the Default Visualization

Scenario

In the previous activity, you made a judicious choice of a geometric object (bar chart or histogram) for a given variable. In this activity, you will see how to improve a visualization. If you are producing plots to look at privately, you might be okay using the default settings. However, when you are creating plots for publication or giving a presentation, or if your company requires a certain theme, you will need to produce more professional plots that adhere to certain visualization rules and guidelines. This activity will help you to improve visuals and create a more professional plot.

Aim

To create improved visualizations by using the Grammar of Graphics.

Steps for Completion

  1. Create two of the plots from the previous activity.
  2. Use the Grammar of Graphics to improve your graphics by layering upon the base graphic.

Refer to the complete code at https://goo.gl/tu7t4y.

Take a look at the following output, histogram 1:

Histogram 2:

Summary

In this chapter, we covered the basics of ggplot2, distinguishing between different types of variables and introducing the best practices for visualizing them. You created basic one- and two-dimensional plots, then analyzed the differences  between them. You used the Grammar of Graphics to change a basic visual into a better, more professional-looking visual.

In the next chapter, we will build upon these skills, uncovering correlations between variables and using statistical summaries to create more advanced plots.

Left arrow icon Right arrow icon

Key benefits

  • Discover structure of ggplot2, grammar of graphics, and geometric objects
  • Study how to design and implement visualization from scratch
  • Explore the advantages of using advanced plots

Description

Applied Data Visualization with R and ggplot2 introduces you to the world of data visualization by taking you through the basic features of ggplot2. To start with, you’ll learn how to set up the R environment, followed by getting insights into the grammar of graphics and geometric objects before you explore the plotting techniques. You’ll discover what layers, scales, coordinates, and themes are, and study how you can use them to transform your data into aesthetical graphs. Once you’ve grasped the basics, you’ll move on to studying simple plots such as histograms and advanced plots such as superimposing and density plots. You’ll also get to grips with plotting trends, correlations, and statistical summaries. By the end of this book, you’ll have created data visualizations that will impress your clients.

Who is this book for?

Applied Data Visualization with R and ggplot2 is for you if you are a professional working with data and R. This book is also for students who want to enhance their data analysis skills by adding informative and professional visualizations. It is assumed that you know basics of the R language and its commands and objects.

What you will learn

  • Set up the R environment, RStudio, and understand structure of ggplot2
  • Distinguish variables and use best practices to visualize them
  • Change visualization defaults to reveal more information about data
  • Implement the grammar of graphics in ggplot2 such as scales and faceting
  • Build complex and aesthetic visualizations with ggplot2 analysis methods
  • Logically and systematically explore complex relationships
  • Compare variables in a single visual, with advanced plotting methods
Estimated delivery fee Deliver to Argentina

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 28, 2018
Length: 140 pages
Edition : 1st
Language : English
ISBN-13 : 9781789612158
Vendor :
RStudio
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Estimated delivery fee Deliver to Argentina

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Sep 28, 2018
Length: 140 pages
Edition : 1st
Language : English
ISBN-13 : 9781789612158
Vendor :
RStudio
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 115.97
R Programming Fundamentals
$38.99
Hands-On Data Science with R
$43.99
Applied Data Visualization with R and ggplot2
$32.99
Total $ 115.97 Stars icon

Table of Contents

5 Chapters
Basic Plotting in ggplot2 Chevron down icon Chevron up icon
Grammar of Graphics and Visual Components Chevron down icon Chevron up icon
Advanced Geoms and Statistics Chevron down icon Chevron up icon
Solutions Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela