Hands-On Data Analysis with NumPy and pandas

Setting Up a Python Data Analysis Environment

In this chapter, we will cover the following topics:

Installing Anaconda
Exploring Jupyter Notebooks
Exploring an alternative to Jupyter
Managing the Anaconda package
Setting up a database

In this chapter, we'll discuss installing Anaconda and managing it. Anaconda is a software package we will use in the following chapters of this book.

Exploring Jupyter Notebooks

In this section, we will be exploring Jupyter Notebooks, the primary tool with which we will do data analysis with Python. We will see what Jupyter Notebooks are, and we will also talk about Markdown, which is what we use to create formatted text in Jupyter Notebooks. In a Jupyter Notebook, there are two types of blocks. There are blocks of Python code that are executable, and then there are formatted, human-readable text blocks.

Users execute the Python code blocks, and the results are inserted directly into the document. Code blocks can be rerun in any order without necessarily affecting later blocks, unless they are also run. Since a Jupyter Notebook is based on IPython, there's some additional functionality, for example, magic functions.

Jupyter Notebooks is included with Anaconda. Jupyter Notebooks allow plain text to be intermixed with code. Plain text can be formatted with a language called Markdown. It is done in plain text. We can also insert paragraphs. The following example is some common syntax you see in Markdown:

The following screenshot shows a Jupyter Notebook:

As you can see, it runs out of a web browser, such as Chrome or Firefox, in this case, Chrome. When we begin the Jupyter Notebook, we are in a file browser. We are in a newly created directory called Untitled Folder. In Jupyter Notebook there are options for creating new Notebooks, text files, and folders. As seen the the preceding screenshot, currently there is no Notebook saved. We will need a Python Notebook, which can be created by selecting the Python option in the New drop-down menu shown in the following screenshot:

When the Notebook has started, we begin with a code block. We can change this code block to a Markdown block, and we can now start entering text.

For example, we can enter a heading. We can also enter plain text along with bold and italics, as shown in the next screenshot:

As you can see, there is some hint of how the rendering will look at the end, but we can actually see the rendering by clicking on the run cell button. If we want to change this, we can double-click on the same cell. Now we're back to plain text editing. Here we add monotype and then click on Run cell again, shown as follows:

On pressing Enter, a new cell is immediately created afterwards. This cell is a Python cell, where we can enter Python code. For example, we can create a variable. We print Hello, world! multiple times, as shown in the next screenshot:

To see what happens when the cell is executed, we simply click on the run cell; also, when we pressed Enter, a new cell block was created. Let's make this cell block a Markdown block. If we want to insert an additional cell, we can press Insert cell below. In this first cell, we're going to enter some code, and in the second cell, we can enter code that is dependent on code in the first cell. Notice what happens when we try to execute the code in the second cell before executing the code in the first. An error will be produced, shown as follows:

The complaint, the variable trigger, has not been defined. In order for the second cell to work, we need to run this first cell. Then, when we run the second cell, we get the expected output. Now let's suppose we were to change the code in this cell; say, instead of trigger = False, we have trigger = True. This second cell will not be aware of the change. If we run this cell again, we get the same output. So we will need to run this cell first, thus affecting the change; then we can run the second cell and get the expected output.

What has happened in the background? What's going on is that there is a kernel, which is basically a running session of Python, tracking all of our variables and everything that has happened up to this point. If we click on Kernel, we can see an option to restart the kernel; this will basically restart our session of Python. We are initially warned that by restarting the kernel, all variables will be lost.

When the kernel has been restarted, it doesn't appear as if anything has changed, but if we run the second cell, an error will be produced because the variable trigger does not exist. We will need to run the previous cell first in order for this cell to work. If we want to, instead, not merely restart the kernel but restart the kernel and also rerun all cells, we need to click on Restart & Run All. After restarting the kernel, all cell blocks will be rerun. It may not appear as if anything has happened, but we have started from the first, run it, run the second cell, and then run the third cell, shown as follows:

We can also import libraries. For example, we can import a module from Matplotlib. In this case, in order for Matplotlib to work interactively in a Jupyter Notebook, we will need to use what's called a magic function, which begins with a %, the name of the magic function, and any sort of parameters we need to pass to it. We'll cover these in more detail later, but first let's run that cell block. plt has now been loaded, and now we can use it. For example, in this last cell, we will type in the following code:

Notice that the output from this cell is inserted directly into the document. We can immediately see the plot that was created. Returning to magic functions, this is not the only function that we have available. Let's see some other functions:

The magic function, magic, will print info about the magic system, as shown in the following screenshot:

Output of "magic" command

Another useful function is timeit, which we can use to profile code. We first type in timeit and then the code that we wish to profile, shown as follows:

The magic function pwd can be used to see what the working directory is, shown as follows:

The magic function cd can be used to change the working directory, shown as follows:

The magic function pylab is useful if we wish to start both Matplotlib and NumPy in interactive mode, shown as follows:

If we wish to see a list of available magic functions, we can type lsmagic, shown as follows:

And if we wish for a quick reference sheet, we can use the magic function quickref, shown as follows:

Now that we're done with this Notebook, let's give it a name. Let's simply call it My Notebook. This is done by clicking on the name of the Notebook at the top of the editor pane. Finally, you can save, and after saving, you can close and halt the Notebook. So this will close the Notebook and halt the Notebook's kernel. That would be the clean way to leave the Notebook. Notice now, in our tree, we can see the directory where the Notebook was saved, and we can see that the Notebook exists in that directory. It is an ipynb document.

Exploring alternatives to Jupyter

Now we will consider alternatives to Jupyter Notebooks. We will look at:

Jupyter QT Console
Spyder
Rodeo
Python interpreter
ptpython

The first alternative we will consider is the Jupyter QT Console; this is a Python interpreter with added functionality, aimed specifically for data analysis.

The following screenshot shows the Jupyter QT Console:

It is very similar to the Jupyter Notebook. In fact, it is effectively the Console version of the Jupyter Notebook. Notice here that we have some interesting syntax. We have In [1], and then let's suppose you were to type in a command, for example:

print ("Hello, world!")

We see some output and then we see In [2].

Now let's try something else:

1 + 1

Right after In [2], we see Out[2]. What does this mean? This is a way to track historical commands and their outputs in a session. To access, say, the command for In [42], we type _i42. So, in this case, if we want to see the input for command 2, we type in i2. Notice that it gives us a string, 1 + 1. In fact, we can run this string.

If we type in eval and then _i2, notice that it gives us the same output as the original command, In [2], did. Now, how about Out[2]? How can we access the actual output? In this case, all we would do is just _ and then the number of the output, say 2. This should give us 2. So this gives you a more convenient way to access historical commands and their outputs.

Another advantage of Jupyter Notebooks is that you can see images. For example, let's get Matplotlib running. First we're going to import Matplotlib with the following command:

import matplotlib.pyplot as plt

After we've imported Matplotlib, recall that we need to run a certain magic, the Matplotlib magic:

%matplotlib inline

We need to give it the inline parameter, and now we can create a Matplotlib figure. Notice that the image shows up right below the command. When we type in _8, it shows that a Matplotlib object was created, but it does not actually show the plot itself. As you can see, we can use the Jupyter console in a more advanced way than the typical Python console. For example, let's work with a dataset called Iris; import it using the following line:

from sklearn.datasets import load_iris

This is a very common dataset used in data analysis. It's often used as a way to evaluate training models. We will also use k-means clustering on this:

from sklearn.cluster import KMeans

The load_Iris function isn't actually the Iris dataset; it is a function that we can use to get the Iris dataset. The following command will actually give us access to that dataset:

iris  = load_iris()

Now we will train a k-means clustering scheme on this dataset:

iris_clusters = KMeans(n_clusters = 3, init =  "random").fit(iris.data)

We can see the documentation right away when we're typing in a function. For example, I know what the end clusters parameter means; it is actually the original doc string from the function. Here, I want the number of clusters to be 3, because I know that there are actually three real clusters in this dataset. Now that a clustering scheme has been trained, we can plot it using the following code:

plt.scatter(iris.data[:, 0], iris.data[:, 1], c = iris_clusters.labels_)

Spyder

Spyder is an IDE unlike the Jupyter Notebook or the Jupyter QT Console. It integrates NumPy, SciPy, Matplotlib, and IPython. It is extensible with plugins, and it is included with Anaconda.

The following screenshot shows Spyder, an actual IDE intended for data analysis and scientific computing:

Spyder Python 3.6

On the right, you can go to File explorer to search for new files to load. Here, we want to open up iris_kmeans.py. This is a file that contains all the commands that we used before in the Jupyter QT Console. Notice on the right that the editor has a console; that is in fact the IPython console, which you saw as the Jupyter QT Console. We can run this entire file by clicking on the Run tab. It will run in the console, shown as follows:

The following screenshot will be the output:

Notice that at the end we see the result of the clustering that we saw before. We can type in commands interactively as well; for example, we can make our computer say Hello, world!.

In the editor, let's type in a new variable, let's say n = 5. Now let's run this file in the editor. Notice that n is a variable that the editor is aware of. Now let's make a change, say n = 6. Unless we were to actually run this file again, the console will be unaware of the change. So if I were to type n in the console again, nothing changes, and it's still 5. You would need to run this line in order to actually see a change.

We also have a variable explorer where we can see the values of variables and change them. For example, I can change the value of n from 6 to 10, shown as follows:

The following screenshot shows the output:

Then, when I go to the console and ask what n is, it will say 10:

n
10

That concludes our discussion of Spyder.

Rodeo

Rodeo is a Python IDE developed by Yhat, and is intended for data analysis applications exclusively. It is intended to emulate the RStudio IDE, which is popular among R users, and it can be downloaded from Rodeo's website. The only advantage of the base Python interpreter is that every Python installation includes it, shown as follows:

ptpython

What may be a lesser known console-based Python REPL is ptpython, designed by Jonathan Slenders. It exists only in the console and is an independent project by him. You can find it on GitHub. It has lightweight features, yet it also includes syntax highlighting, autocompletion, and even IPython. It can be installed with the following command:

pip install ptpython

That concludes our discussion on alternatives to the Jupyter Notebooks.

Package management with Conda

We will now discuss package management with Conda. In this section, we're going to take a look at the following topics:

What is Conda?
Managing Conda environments
Managing Python with Conda
Managing packages with Conda

What is Conda?

So what is Conda? Conda is the Anaconda package manager. Conda allows us to create and manage multiple environments, allowing multiple versions of Python, R, and their relevant packages to exist. This can be very useful if you need to develop for different systems with different versions of Python and their packages. Conda allows you to manage Python and R versions, and it also facilitates installation and management of packages.

Conda environment management

A Conda environment allows developers to use and manage different versions of Python in its packages. This can be useful for testing and development on legacy systems. Environments can be saved, cloned, and exported so that others can replicate results.

Here are some common environment management commands.

For environment creation:

conda create --name env_name prog1 prog2
conda create --name env_name python=3 prog3

For listing environments:

conda env list

To verify the environment:

conda info --envs

To clone the environment:

conda create --name new_env --clone old_env

To remove environments:

conda remove --name env_name -all

Users can share environments by creating a YAML file, which recipients can use to construct an identical environment. You can do this by hand, where you effectively replicate what Anaconda would make, but it is much easier to have Anaconda create a YAML file for you.

After you have created such a file, or if you've received this file from another user, it is very easy to create a new environment.

Managing Python

As mentioned earlier, Anaconda allows you to manage multiple versions of Python. It is possible to search and see which versions of Python are available for installation. You can verify which version of Python is in an environment, and you can even create environments for Python 2.7. You can also update the version of Python that is in a current environment.

Package management

Let's suppose that we're interested in installing the package selenium, which is a package that is used for web scraping and also web testing. We can list the packages that are currently installed, and we can give the command to install a new package.

First, we should search to see whether the package is available from the Conda system. Not all packages that are available on pip are available from Conda. That said, it is in fact possible to install a package available from pip, although hopefully, if we wish to install a package, we can use the following command:

conda install selenium

If selenium is the package we're interested in, it can be downloaded automatically from the internet, unless you have a file that Anaconda can install directly from your system.

To install packages via pip, use the following:

pip install package_name

Packages, of course, can be removed as follows:

conda remove selenium

Setting up a database

We'll now begin discussing setting up a database for you to use. In this section, we're going to look at the following topics:

Installing MySQL
Installing MySQL connector for Python
Creating, using, and deleting databases

MySQL connector is necessary in order to use MySQL with Python. There are many SQL database implementations in existence, and while MySQL may not be the simplest database management system, it is full-featured, it is industrial-strength, it is commonly seen in real world situations, and furthermore, it is free and open source, which means it's an excellent tool to learn on. You can obtain the MySQL Community Edition, which is the free and open source version, from MySQL's website (go to https://dev.mysql.com/downloads/).

Installing MySQL

For Linux systems, if it's possible, I recommend that you install MySQL using whatever package management system is available to you. Perhaps go for YUM, if you're using a Red-Hat-based distribution, APT if you're using a Debian-based distro, or SUSE's repository system. If you do not have a package management system, you may need to install MySQL from the source.

Windows users can install MySQL directly from their website. You should also be aware that MySQL comes in 32-bit and 64-bit binaries, but whatever program you download will likely install the correct version for your system.

Here is the web page from where you can download MySQL for Windows:

I recommend that you use the MySQL Installer. Scroll down, and when you're looking for which binary to download, be aware that this first binary says web community. This is going to be an installer that downloads MySQL from the internet as you're doing the installation. Notice that it's much smaller than the other binary. It basically includes everything you need in order to be able to install MySQL. This would be the one I would recommend you download if you're following along.

There are generally available releases; these should be stable. Next to the generally available releases tab are the development releases; I recommend that you do not download these unless you know what you're doing.

MySQL connectors

MySQL functions like a driver on your system, and other applications interact with MySQL as if it were a driver. So, you will need to download a MySQL connector in order to be able to use MySQL with Python. This will allow Python to communicate with MySQL. What you will end up doing is loading in a package, and you will start up a connection with MySQL. The Python connector can be downloaded from MySQL's website (go to https://dev.mysql.com/downloads/connector/).

This web page is universal for any operating system, so you will need to select the appropriate platform, such as Linux, OS X, or Windows. You'll need to select and download the installer best matching the system's architecture, whether you have a 32-bit or 64-bit, and the version of Python. And then you will use the install wizard in order to install it on your system.

Here is the page for downloading and installing the connector:

Notice that we can choose here which platform is appropriate. We even have platform-independent and source code versions. It may also be possible to install this using a package management system, such as APT if you're using a Debian-based system, Ubuntu or YUM if you're using a Red-Hat-based system, and so on. We have many different installers, so we will need to be aware which version of Python we're using. It is recommended that you use the version that is closest to the one that is actually being used in your project. You'll also need to choose between 32-bit and 64-bit. Then you click on download and follow the instructions of the installer.

So, database management is a major topic; to go into everything about database management would take us well beyond the scope of this book. We're not going to talk about how a good database is designed; I recommend that you go to another resource, perhaps another Packt product that would explain these topics, because they are important. Regarding SQL, we will tell you only the commands that you need to use SQL at a basic level. There's also no discussion on permissions, so we're going to assume that your database gives full permission to whichever user is using it, and there's only one user at a time.

Creating a database

After installing MySQL in the MySQL command line, we can create a database with the following command, with the name of the database after it:

create database

Every command must be ended by a semicolon; otherwise, MySQL will wait until the command is actually finished.

You can see all available databases with this command:

show databases

We can specify which database we want to use with the following command:

use database_name

If we wish to delete a database, we can do so with the following command:

drop database database_name

Here is the MySQL command line:

Let's practice managing databases. We can create a database with the following command:

create database mydb

To see all databases, we can use this command:

show databases

There are multiple databases here, some of which are from other projects, but as you can see, the database mydb, which we just created, is shown as follows:

If we want to use this database, the command use mydb can be used. MySQL says the database has been changed. What this means is that when I issue commands such as creating tables, reading from tables, or adding new data, all of this will be done with the database mydb.

Let's say we want to delete the database mydb; we can do so with the following command:

drop database mydb

This will delete the database.

Filter reviews by

All

Packt verified reviews

Amazon verified reviews

Akshay Jan 02, 2024

Excellent book that gets down to the basics!

Subscriber review

S. Sankara Subramanian Sep 10, 2018

no specific comments

Amazon Verified review

Amazon Customer Aug 27, 2018

Would recommend this book to those with a background in data analysis and are untrained in using Python.Pros - This book delivers exactly what is written in the title, no more, no less. The writing style is introductory and there are plenty of examples. The book addresses how to clean data using Python which is mandatory when performing data analysis. Examples discussed in this book could be used to supplement references which are less practical.Cons - The editing uses incorrect fonts on words that refer to technical terms. For example, some Python functions in this book are type-font, but the editor frequently omits this formatting. Many screenshots include cursors. Some sections, such as the linear algebra section, explain how to implement code but do not explain the context or give references.

BBCReview Sep 30, 2021

The books has good content on Numpy and Pandas, but you can't read the code snippets without a magnifying glass, or worse yet, zooming each one. Not the fault of the author, but it's darn hard to follow when it take 10 seconds to read each each snippet.

Philip H Sep 15, 2018

The explanations are reasonable although the book could have been written much more concisely. The examples are written in tiny fonts

Hands-On Data Analysis with NumPy and pandas: Implement Python packages from data manipulation to processing

What do you get with Print?

Hands-On Data Analysis with NumPy and pandas

Setting Up a Python Data Analysis Environment

What is Anaconda?

Installing Anaconda

Exploring Jupyter Notebooks

Exploring alternatives to Jupyter

Spyder

Rodeo

ptpython

Package management with Conda

What is Conda?

Conda environment management

Managing Python

Package management

Setting up a database

Installing MySQL

MySQL connectors

Creating a database

Summary

Page 1 of 8

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Hands-On Data Analysis with NumPy and pandas: Implement Python packages from data manipulation to processing

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs