Plotting with pandas and seaborn
Now that we have a basic sense of how to load and handle data in a pandas
DataFrame object, let's get started with making some simple plots from data. While there are several plotting libraries in Python (including matplotlib
, plotly
, and seaborn
), in this chapter, we will mainly explore the pandas
and seaborn
libraries, which are extremely useful, popular, and easy to use.
Creating Simple Plots to Visualize a Distribution of Variables
matplotlib
is a plotting library available in most Python distributions and is the foundation for several plotting packages, including the built-in plotting functionality of pandas
and seaborn
. matplotlib
enables control of every single aspect of a figure and is known to be verbose. Both seaborn
and pandas
visualization functions are built on top of matplotlib
. The built-in plotting tool of pandas
.is a useful exploratory tool to generate figures that are not ready for primetime but useful to understand the dataset you are working with. seaborn
, on the other hand, has APIs to draw a wide variety of aesthetically pleasing plots.
To illustrate certain key concepts and explore the diamonds
dataset, we will start with two simple visualizations in this chapter—histograms and bar plots.
Histograms
A histogram of a feature is a plot with the range of the feature on the x-axis and the count of data points with the feature in the corresponding range on the y-axis.
Let's look at the following exercise of plotting a histogram with pandas
.
Exercise 8: Plotting and Analyzing a Histogram
In this exercise, we will create a histogram of the frequency of diamonds in the dataset with their respective carat
specifications on the x-axis:
- Import the necessary modules:
import seaborn as sns import pandas as pd
- Import the
diamonds
dataset fromseaborn
:diamonds_df = sns.load_dataset('diamonds')
- Plot a histogram using the
diamonds
dataset wherex axis = carat
:diamonds_df.hist(column='carat')
The output is as follows:
Figure 1.14: Histogram plot
The y axis in this plot denotes the number of diamonds in the dataset with the
carat
specification on the x-axis.The
hist
function has a parameter calledbins
, which literally refers to the number of equally sizedbins
into which the data points are divided. By default, the bins parameter is set to10
inpandas
. We can change this to a different number, if we wish. - Change the
bins
parameter to50
:diamonds_df.hist(column='carat', bins=50)
The output is as follows:
Figure 1.15: Histogram with bins = 50
This is a histogram with
50
bins. Notice how we can see a more fine-grained distribution as we increase the number of bins. It is helpful to test with multiple bin sizes to know the exact distribution of the feature. The range ofbin
sizes varies from1
(where all values are in the same bin) to the number of values (where each value of the feature is in one bin). - Now, let's look at the same function using
seaborn
:sns.distplot(diamonds_df.carat)
The output is as follows:
Figure 1.16: Histogram plot using seaborn
There are two noticeable differences between the
pandas
hist
function andseaborn
distplot
:pandas
sets thebins
parameter to a default of10
, butseaborn
infers an appropriate bin size based on the statistical distribution of the dataset.- By default, the
distplot
function also includes a smoothed curve over the histogram, called a kernel density estimation.The kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Usually, a KDE doesn't tell us anything more than what we can infer from the histogram itself. However, it is helpful when comparing multiple histograms on the same plot. If we want to remove the KDE and look at the histogram alone, we can use the
kde=False
parameter.
- Change
kde=False
to remove the KDE:sns.distplot(diamonds_df.carat, kde=False)
The output is as follows:
Figure 1.17: Histogram plot with KDE = false
Also note that the
bins
parameter seemed to render a more detailed plot when the bin size was increased from10
to50
. Now, let's try to increase it to 100. - Increase the
bins
size to100
:sns.distplot(diamonds_df.carat, kde=False, bins=100)
The output is as follows:
Figure 1.18: Histogram plot with increased bin size
The histogram with
100
bins shows a better visualization of the distribution of the variable—we see there are several peaks at specific carat values. Another observation is that mostcarat
values are concentrated toward lower values and thetail
is on the right—in other words, it is right-skewed.A log transformation helps in identifying more trends. For instance, in the following graph, the x-axis shows log-transformed values of the
price
variable, and we see that there are two peaks indicating two kinds of diamonds—one with a high price and another with a low price. - Use a log transformation on the histogram:
import numpy as np sns.distplot(np.log(diamonds_df.price), kde=False)
The output is as follows:
Figure 1.19: Histogram using a log transformation
That's pretty neat. Looking at the histogram, even a naive viewer immediately gets a picture of the distribution of the feature. Specifically, three observations are important in a histogram:
- Which feature values are more frequent in the dataset (in this case, there is a peak at around 6.8 and another peak between
8.5
and9
—note thatlog(price) = values
, in this case, - How many peaks exist in the data (the peaks need to be further inspected for possible causes in the context of the data)
- Whether there are any outliers in the data
Bar Plots
Another type of plot we will look at in this chapter is the bar plot.
In their simplest form, bar plots display counts of categorical variables. More broadly, bar plots are used to depict the relationship between a categorical variable and a numerical variable. Histograms, meanwhile, are plots that show the statistical distribution of a continuous numerical feature.
Let's see an exercise of bar plots in the diamonds
dataset. First, we shall present the counts of diamonds of each cut quality that exist in the data. Second, we shall look at the price associated with the different types of cut quality (Ideal
, Good
, Premium
, and so on) in the dataset and find out the mean price distribution. We will use both pandas
and seaborn
to get a sense of how to use the built-in plotting functions in both libraries.
Before generating the plots, let's look at the unique values in the cut
and clarity
columns, just to refresh our memory.
Exercise 9: Creating a Bar Plot and Calculating the Mean Price Distribution
In this exercise, we'll learn how to create a table using the pandas
crosstab
function. We'll use a table to generate a bar plot. We'll then explore a bar plot generated using the seaborn
library and calculate the mean price distribution. To do so, let's go through the following steps:
- Import the necessary modules and dataset:
import seaborn as sns import pandas as pd
- Import the
diamonds
dataset fromseaborn
:diamonds_df = sns.load_dataset('diamonds')
- Print the unique values of the
cut
column:diamonds_df.cut.unique()
The output will be as follows:
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
- Print the unique values of the
clarity
column:diamonds_df.clarity.unique()
The output will be as follows:
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'], dtype=object)
Note
unique()
returns an array. There are five uniquecut
qualities and eight unique values inclarity
. The number of unique values can be obtained usingnunique()
inpandas
. - To obtain the counts of diamonds of each cut quality, we first create a table using the
pandas
crosstab()
function:cut_count_table = pd.crosstab(index=diamonds_df['cut'],columns='count') cut_count_table
The output will be as follows:
Figure 1.20: Table using the crosstab function
- Pass these counts to another
pandas
function,plot(kind='bar')
:cut_count_table.plot(kind='bar')
The output will be as follows:
Figure 1.21: Bar plot using a pandas DataFrame
We see that most of the diamonds in the dataset are of the
Ideal
cut quality, followed byPremium
,Very Good
,Good
, andFair
. Now, let's see how to generate the same plot usingseaborn
. - Generate the same bar plot using
seaborn
:sns.catplot("cut", data=diamonds_df, aspect=1.5, kind="count", color="b")
The output will be as follows:
Figure 1.22: Bar plot using seaborn
Notice how the
catplot()
function does not require us to create the intermediate count table (usingpd.crosstab()
), and reduces one step in the plotting process. - Next, here is how we obtain the mean price distribution of different cut qualities using
seaborn
:import seaborn as sns from numpy import median, mean sns.set(style="whitegrid") ax = sns.barplot(x="cut", y="price", data=diamonds_df,estimator=mean)
The output will be as follows:
Figure 1.23: Bar plot with the mean price distribution
Here, the black lines (error bars) on the rectangles indicate the uncertainty (or spread of values) around the mean estimate. By default, this value is set to
95%
confidence. How do we change it? We use theci=68
parameter, for instance, to set it to68%
. We can also plot the standard deviation in the prices usingci=sd
. - Reorder the x axis bars using
order
:ax = sns.barplot(x="cut", y="price", data=diamonds_df, estimator=mean, ci=68, order=['Ideal','Good','Very Good','Fair','Premium'])
The output will be as follows:
Figure 1.24: Bar plot with proper order
Grouped bar plots can be very useful for visualizing the variation of a particular feature within different groups. Now that you have looked into tweaking the plot parameters in a grouped bar plot, let's see how to generate a bar plot grouped by a specific feature.
Exercise 10: Creating Bar Plots Grouped by a Specific Feature
In this exercise, we will use the diamonds
dataset to generate the distribution of prices with respect to color
for each cut
quality. In Exercise 9, Creating a Bar Plot and Calculating the Mean Price Distribution, we looked at the price distribution for diamonds of different cut qualities. Now, we would like to look at the variation in each color:
- Import the necessary modules—in this case, only
seaborn
:#Import seaborn import seaborn as sns
- Load the dataset:
diamonds_df = sns.load_dataset('diamonds')
- Use the
hue
parameter to plot nested groups:ax = sns.barplot(x="cut", y="price", hue='color', data=diamonds_df)
The output is as follows:
Figure 1.25: Grouped bar plot with legends
Here, we can observe that the price patterns for diamonds of different colors are similar for each cut quality. For instance, for Ideal
diamonds, the price distribution of diamonds of different colors is the same as that for Premium
, and other diamonds.