Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Big Data Analysis with Python
Big Data Analysis with Python

Big Data Analysis with Python: Combine Spark and Python to unlock the powers of parallel computing and machine learning

eBook
€13.98 €19.99
Paperback
€24.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Big Data Analysis with Python

Chapter 2. Statistical Visualizations

Note

Learning Objectives

We will start our journey by understanding the power of Python to manipulate and visualize data, creating useful analysis.

By the end of this chapter, you will be able to:

  • Use graphs for data analysis

  • Create graphs of various types

  • Change graph parameters such as color, title, and axis

  • Export graphs for presentation, printing, and other uses

Note

In this chapter, we will illustrate how the students can generate visualizations with Matplotlib and Seaborn.

Introduction


In the last chapter, we learned that the libraries that are most commonly used for data science work with Python. Although they are not big data libraries per se, the libraries of the Python Data Science Stack (NumPy, Jupyter, IPython, Pandas, and Matplotlib) are important in big data analysis.

As we will demonstrate in this chapter, no analysis is complete without visualizations, even with big datasets, so knowing how to generate images and graphs from data in Python is relevant for our goal of big data analysis. In the subsequent chapters, we will demonstrate how to process large volumes of data and aggregate it to visualize it using Python tools.

There are several visualization libraries for Python, such as Plotly, Bokeh, and others. But one of the oldest, most flexible, and most used is Matplotlib. But before going through the details of creating a graph with Matplotlib, let's first understand what kinds of graphs are relevant for analysis.

Types of Graphs and When to Use Them


Every analysis, whether on small or large datasets, involves a descriptive statistics step, where the data is summarized and described by statistics such as mean, median, percentages, and correlation. This step is commonly the first step in the analysis workflow, allowing a preliminary understanding of the data and its general patterns and behaviors, providing grounds for the analyst to formulate hypotheses, and directing the next steps in the analysis. Graphs are powerful tools to aid in this step, enabling the analyst to visualize the data, create new views and concepts, and communicate them to a larger audience.

There is a vast amount of literature on statistics about visualizing information. The classic book, Envisioning Information, by Edward Tufte, demonstrates beautiful and useful examples of how to present information in graphical form. In another book, The Visual Display of Quantitative Information, Tufte enumerates a few qualities that a graph...

Components of a Graph


Each graph has a set of common components that can be adjusted. The names that Matplotlib uses for these components are demonstrated in the following graph:

Figure 2.3: Components of a graph

The components of a graph are as follows:

  • Figure: The base of the graph, where all the other components are drawn.

  • Axis: Contains the figure elements and sets the coordinate system.

  • Title: The title gives the graph its name.

  • X-axis label: The name of the x-axis, usually named with the units.

  • Y-axis label: The name of the y-axis, usually named with the units.

  • Legend: A description of the data plotted in the graph, allowing you to identify the curves and points in the graph.

  • Ticks and tick labels: They indicate the points of reference on a scale for the graph, where the values of the data are. The labels indicate the values themselves.

  • Line plots: These are the lines that are plotted with the data.

  • Markers: Markers are the pictograms that mark the point data.

  • Spines: The lines that delimit the...

Seaborn


Seaborn (https://seaborn.pydata.org/) is part of the PyData family of tools and is a visualization library based on Matplotlib with the goal of creating statistical graphs more easily. It can operate directly on DataFrames and series, doing aggregations and mapping internally. Seaborn uses color palettes and styles to make visualizations consistent and more informative. It also has functions that can calculate some statistics, such as regression, estimation, and errors. Some specialized plots, such as violin plots and multi-facet plots, are also easy to create with Seaborn.

Which Tool Should Be Used?


Seaborn tries to make the creation of some common analysis graphs easier than using Matplotlib directly. Matplotlib can be considered more low-level than Seaborn, and although this makes it a bit more cumbersome and verbose, it gives analysts much more flexibility. Some graphs, which with Seaborn are created with one function call, would take several lines of code to achieve using Matplotlib.

There is no rule to determine whether an analyst should use only the pandas plotting interface, Matplotlib directly, or Seaborn. Analysts should keep in mind the visualization requirements and the level of configuration required to create the desired graph.

Pandas' plotting interface is easier to use but is more constrained and limited. Seaborn has several graph patterns ready to use, including common statistical graphs such as pair plots and boxplots, but requires that the data is formatted into a tidy format and is more opinionated on how the graphs should look. Matplotlib...

Types of Graphs


The first type of graph that we will present is the line graph or line chart. A line graph displays data as a series of interconnected points on two axes (x and y), usually Cartesian, ordered commonly by the x-axis. Line charts are useful for demonstrating trends in data, such as in time series, for example.

A graph related to the line graph is the scatter plot. A scatter plot represents the data as points in Cartesian coordinates. Usually, two variables are demonstrated in this graph, although more information can be conveyed if the data is color-coded or size-coded by category, for example. Scatter plots are useful for showing the relationship and possible correlation between variables.

Histograms are useful for representing the distribution of data. Unlike the two previous examples, histograms show only one variable, usually on the x-axis, while the y-axis shows the frequency of occurrence of the data. The process of creating a histogram is a bit more involved than the line...

Pandas DataFrames and Grouped Data


As we learned in the previous chapter, when analyzing data and using Pandas to do so, we can use the plot functions from Pandas or use Matplotlib directly. Pandas uses Matplotlib under the hood, so the integration is great. Depending on the situation, we can either plot directly from pandas or create a figure and an axes with Matplotlib and pass it to pandas to plot. For example, when doing a GroupBy, we can separate the data into a GroupBy key. But how can we plot the results of GroupBy? We have a few approaches at our disposal. We can, for example, use pandas directly, if the DataFrame is already in the right format:

Note

The following code is a sample and will not get executed.

fig, ax = plt.subplots()
df = pd.read_csv('data/dow_jones_index.data')
df[df.stock.isin(['MSFT', 'GE', 'PG'])].groupby('stock')['volume'].plot(ax=ax)

Or we can just plot each GroupBy key on the same plot:

fig, ax = plt.subplots()
df.groupby('stock').volume.plot(ax=ax)

For the following...

Changing Plot Design: Modifying Graph Components


So far, we've looked at the main graphs used in analyzing data, either directly or grouped, for comparison and trend visualization. But one thing that we can see is that the design of each graph is different from the others, and we don't have basic things such as a title and legends.

We've learned that a graph is composed of several components, such as a graph title, x and y labels, and so on. When using Seaborn, the graphs already have x and y labels, with the names of the columns. With Matplotlib, we don't have this. These changes are not only cosmetic.

The understanding of a graph can be greatly improved when we adjust things such as line width, color, and point size too, besides labels and titles. A graph must be able to stand on its own, so title, legends, and units are paramount. How can we apply the concepts that we described previously to make good, informative graphs on Matplotlib and Seaborn?

The possible number of ways that plots can...

Exporting Graphs


After generating our visualizations and configuring the details, we can export our graphs to a hard copy format, such as PNG, JPEG, or SVG. If we are using the interactive API in the notebook, we can just call the savefig function over the pyplot interface, and the last generated graph will be exported to the file:

df.plot(kind='scatter', x='weight', y='horsepower', figsize=(20,10))
plt.savefig('horsepower_weight_scatter.png')

Figure 2.26: Exporting the graphs

All plot configurations will be carried to the plot. To export a graph when using the object-oriented API, we can call savefig from the figure:

fig, ax = plt.subplots()
df.plot(kind='scatter', x='weight', y='horsepower', figsize=(20,10), ax=ax)
fig.savefig('horsepower_weight_scatter.jpg')

Figure 2.27: Saving the graph

We can change some parameters for the saved image:

  • dpi: Adjust the saved image resolution.

  • facecolor: The face color of the figure.

  • edgecolor: The edge color of the figure, around the graph.

  • format: Usually PNG...

Summary


In this chapter, we have seen the importance of creating meaningful and interesting visualizations when analyzing data. A good data visualization can immensely help the analyst's job, representing data in a way that can reach larger audiences and explain concepts that could be hard to translate into words or to represent with tables.

A graph, to be effective as a data visualization tool, must show the data, avoid distortions, make understanding large datasets easy, and have a clear purpose, such as description or exploration. The main goal of a graph is to communicate data, so the analyst must keep that in mind when creating a graph. A useful graph is more desirable than a beautiful one.

We demonstrated some kinds of graphs commonly used in analysis: the line graph, the scatter plot, the histogram, and the boxplot. Each graph has its purpose and application, depending on the data and the goal. We have also shown how to create graphs directly from Matplotlib, from pandas, or a combination...

Left arrow icon Right arrow icon

Key benefits

  • Get a hands-on, fast-paced introduction to the Python data science stack
  • Explore ways to create useful metrics and statistics from large datasets
  • Create detailed analysis reports with real-world data

Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems. The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools. By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

Who is this book for?

Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.

What you will learn

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 10, 2019
Length: 276 pages
Edition : 1st
Language : English
ISBN-13 : 9781789950731
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning

Product Details

Publication date : Apr 10, 2019
Length: 276 pages
Edition : 1st
Language : English
ISBN-13 : 9781789950731
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 77.97
Data Wrangling with Python
€32.99
Big Data Analysis with Python
€24.99
Hands-On Big Data Analytics with PySpark
€19.99
Total 77.97 Stars icon

Table of Contents

8 Chapters
The Python Data Science Stack Chevron down icon Chevron up icon
Statistical Visualizations Chevron down icon Chevron up icon
Working with Big Data Frameworks Chevron down icon Chevron up icon
Diving Deeper with Spark Chevron down icon Chevron up icon
Handling Missing Values and Correlation Analysis Chevron down icon Chevron up icon
Exploratory Data Analysis Chevron down icon Chevron up icon
Reproducibility in Big Data Analysis Chevron down icon Chevron up icon
Creating a Full Analysis Report Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
(1 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 100%
RLKurtz Jul 14, 2020
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
This book is a broken mess.The sections on theory, for example the discussion on the which graph to use where in chapter 2, are fine.But then, suddenly, you are asked to do activities by yourself which the textbook hasn't prepared you for and you have to resort to the appendices to see how it should be done. Sometimes (e.g. activity 7) examples just doesn't work - the code is completely broken and the book doesn't prepare you for how to resolve it. No amount of googling is able to put you on the right track either. Exercise 23 just ends in a "type error" as lambda takes 0 positional arguments but 1 was given which, if the forums are to be believed, is because the function is not longer supported by Python. Next your are doing exercises and can follow along the examples in the chapter and all is fine again, but then functionality is introduced which isn't explained and you end up parroting the text. The text makes you install Hadoop, which with the problem solving took me 3 hours, and then moves unceremoniously on to Spark, which with the errors in the text is its own can of worms without ever going into Hadoop at all.This books just comes across as poorly edited and with little to none quality control resulting in frustration. Give it a pass.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.