The Statistics and Calculus with Python Workshop

2. Python's Main Tools for Statistics

Overview

This chapter presents a practical introduction to the main libraries that most statistics practitioners use in Python. It will cover some of the most important and useful concepts, functions, and Application Programming Interfaces (APIs) of each of the key libraries. Almost all of the computational tools that will be needed for the rest of this book will be introduced in this chapter.

By the end of this chapter, you will understand the idea behind array vectorization of the NumPy library and be able to use its sampling functionalities. You'll be able to initialize pandas DataFrames to represent tabular data and manipulate their content. You'll also understand the importance of data visualization in data analysis and be able to utilize Python's two most popular visualization libraries: Matplotlib and Seaborn.

Scientific Computing and NumPy Basics

The term scientific computing has been used several times in this workshop so far; in the broadest sense of the term, it denotes the process of using computer programs (or anything with computing capabilities) to model and solve a specific problem in mathematics, engineering, or science. Examples may include mathematical models to look for and analyze patterns and trends in biological and social data, or machine learning models to make future predictions using economic data. As you may have already noticed, this definition has a significant overlap with the general fields of data science, and sometimes the terms are even used interchangeably.

The main workhorse of many (if not most) scientific computing projects in Python is the NumPy library. Since NumPy is an external library that does not come preinstalled with Python, we need to download and install it. As you may already know, installing external libraries and packages in Python can be done easily using package managers such as pip or Anaconda.

From your Terminal, run the following command to use pip to install NumPy in your Python environment:

$ pip install numpy

If you are currently in an Anaconda environment, you can run the following command instead:

$ conda install numpy

With these simple commands, all the necessary steps in the installation process are taken care of for us.

Some of NumPy's most powerful capabilities include vectorized, multi-dimensional array representations of objects; implementation of a wide range of linear algebraic functions and transformations; and random sampling. We will cover all of these topics in this section, starting with the general concept of arrays.

NumPy Arrays

We have actually already come across the concept of an array in the previous chapter, when we discussed Python lists. In general, an array is also a sequence of different elements that can be accessed individually or manipulated as a whole. As such, NumPy arrays are very similar to Python lists; in fact, the most common way to declare a NumPy array is to pass a Python list to the numpy.array() method, as illustrated here:

>>> import numpy as np
>>> a = np.array([1, 2, 3])
>>> a
array([1, 2, 3])
>>> a[1]
2

The biggest difference we need to keep in mind is that elements in a NumPy array need to be of the same type. For example, here, we are trying to create an array with two numbers and a string, which causes NumPy to forcibly convert all elements in the array into strings (the <U21 data type denotes the Unicode strings with fewer than 21 characters):

>>> b = np.array([1, 2, 'a'])
>>> b
array(['1', '2', 'a'], dtype='<U21')

Similar to the way we can create multi-dimensional Python lists, NumPy arrays support the same option:

>>> c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> c
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Note

While working with NumPy, we often refer to multi-dimensional arrays as matrices.

Apart from initialization from Python lists, we can create NumPy arrays that are in a specific form. In particular, a matrix full of zeros or ones can be initialized using np.zeros() and np.ones(), respectively, with a given dimension and data type. Let's have a look at an example:

>>> zero_array = np.zeros((2, 2))  # 2 by 2 zero matrix
>>> zero_array
array([[0., 0.],
       [0., 0.]])

Here, the tuple (2, 2) specifies that the array (or matrix) being initialized should have a two-by-two dimension. As we can see by the dots after the zeros, the default data type of a NumPy array is a float and can be further specified using the dtype argument:

>>> one_array = np.ones((2, 2, 3), dtype=int)  # 3D one integer matrix
>>> one_array
array([[[1, 1, 1],
        [1, 1, 1]],
        [[1, 1, 1],
        [1, 1, 1]]])

All-zero or all-one matrices are common objects in mathematics and statistics, so these API calls will prove to be quite useful later on. Now, let's look at a common matrix object whose elements are all random numbers. Using np.random.rand(), we can create a matrix of a given shape, whose elements are uniformly sampled between 0 (inclusive) and 1 (exclusive):

>>> rand_array = np.random.rand(2, 3)
>>> rand_array
array([[0.90581261, 0.88732623, 0.291661  ],
       [0.44705149, 0.25966191, 0.73547706]])

Notice here that we are not passing the desired shape of our matrix as a tuple anymore, but as individual parameters of the np.random.rand() function instead.

If you are not familiar with the concept of randomness and random sampling from various distributions, don't worry, as we will cover that topic later on in this chapter as well. For now, let's move forward with our discussion about NumPy arrays, particularly about indexing and slicing.

You will recall that in order to access individual elements in a Python list, we pass its index inside square brackets next to the list variable; the same goes for one-dimensional NumPy arrays:

>>> a = np.array([1, 2, 3])
>>> a[0]
1
>>> a[1]
2

However, when an array is multi-dimensional, instead of using multiple square brackets to access subarrays, we simply need to separate the individual indices using commas. For example, we access the element in the second row and the second column of a three-by-three matrix as follows:

>>> b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> b
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> b[1, 1]
5

Slicing NumPy arrays can be done in the same way: using commas. This syntax is very useful in terms of helping us access submatrices with more than one dimension in a matrix:

>>> a = np.random.rand(2, 3, 4)  # random 2-by-3-by-4 matrix
>>> a
array([[[0.54376986, 0.00244875, 0.74179644, 0.14304955],
        [0.77229612, 0.32254451, 0.0778769 , 0.2832851 ],
        [0.26492963, 0.5217093 , 0.68267418, 0.29538502]],
       [[0.94479229, 0.28608588, 0.52837161, 0.18493272],
        [0.08970716, 0.00239815, 0.80097454, 0.74721516],
        [0.70845696, 0.09788526, 0.98864408, 0.82521871]]])
>>> a[1, 0: 2, 1:]
array([[0.28608588, 0.52837161, 0.18493272],
       [0.00239815, 0.80097454, 0.74721516]])

In the preceding example, a[1, 0: 2, 1:] helps us to access the numbers in the original matrix, a; that is, in the second element in the first axis (corresponding to index 1), the first two elements in the second axis (corresponding to 0: 2), and the last three elements in the third axis (corresponding to 1:). This option is one reason why NumPy arrays are more powerful and flexible than Python lists, which do not support multi-dimensional indexing and slicing, as we have demonstrated.

Finally, another important syntax to manipulate NumPy arrays is the np.reshape() function, which, as its name suggests, changes the shape of a given NumPy array. The need for this functionality can arise on multiple occasions: when we need to display an array in a certain way for better readability, or when we need to pass an array to a built-in function that only takes in arrays of a certain shape.

We can explore the effect of this function in the following code snippet:

>>> a
array([[[0.54376986, 0.00244875, 0.74179644, 0.14304955],
        [0.77229612, 0.32254451, 0.0778769 , 0.2832851 ],
        [0.26492963, 0.5217093 , 0.68267418, 0.29538502]],
       [[0.94479229, 0.28608588, 0.52837161, 0.18493272],
        [0.08970716, 0.00239815, 0.80097454, 0.74721516],
        [0.70845696, 0.09788526, 0.98864408, 0.82521871]]])
>>> a.shape 
(2, 3, 4)
>>> np.reshape(a, (3, 2, 4))
array([[[0.54376986, 0.00244875, 0.74179644, 0.14304955],
        [0.77229612, 0.32254451, 0.0778769 , 0.2832851 ]],
       [[0.26492963, 0.5217093 , 0.68267418, 0.29538502],
        [0.94479229, 0.28608588, 0.52837161, 0.18493272]],
       [[0.08970716, 0.00239815, 0.80097454, 0.74721516],
        [0.70845696, 0.09788526, 0.98864408, 0.82521871]]])

Note that the np.reshape() function does not mutate the array that is passed in-place; instead, it returns a copy of the original array with the new shape without modifying the original. We can also assign this returned value to a variable.

Additionally, notice that while the original shape of the array is (2, 3, 4), we changed it to (3, 2, 4). This can only be done when the total numbers of elements resulting from the two shapes are the same (2 x 3 x 4 = 3 x 2 x 4 = 24). An error will be raised if the new shape does not correspond to the original shape of an array in this way, as shown here:

>>> np.reshape(a, (3, 3, 3))
-------------------------------------------------------------------------
ValueError                          Traceback (most recent call last)
...
ValueError: cannot reshape array of size 24 into shape (3,3,3)

Speaking of reshaping a NumPy array, transposing a matrix is a special form of reshaping that flips the elements in the matrix along its diagonal. Computing the transpose of a matrix is a common task in mathematics and machine learning. The transpose of a NumPy array can be computed using the [array].T syntax. For example, when we run a.T in the Terminal, we get the transpose of matrix a, as follows:

>>> a.T
array([[[0.54376986, 0.94479229],
       [0.77229612, 0.08970716],
        [0.26492963, 0.70845696]],
       [[0.00244875, 0.28608588],
        [0.32254451, 0.00239815],
        [0.5217093 , 0.09788526]],
       [[0.74179644, 0.52837161],
        [0.0778769 , 0.80097454],
        [0.68267418, 0.98864408]],
       [[0.14304955, 0.18493272],
        [0.2832851 , 0.74721516],
        [0.29538502, 0.82521871]]])

And with that, we can conclude our introduction to NumPy arrays. In the next section, we will learn about another concept that goes hand in hand with NumPy arrays: vectorization.

Vectorization

In the broadest sense, the term vectorization in computer science denotes the process of applying a mathematical operation to an array (in a general sense) element by element. For example, an add operation where every element in an array is added to the same term is a vectorized operation; the same goes for vectorized multiplication, where all elements in an array are multiplied by the same term. In general, vectorization is achieved when all array elements are put through the same function.

Vectorization is done by default when an applicable operation is performed on a NumPy array (or multiple arrays). This includes binary functions such as addition, subtraction, multiplication, division, power, and mod, as well as several unary built-in functions in NumPy, such as absolute value, square root, trigonometric functions, logarithmic functions, and exponential functions.

Before we see vectorization in NumPy in action, it is worth discussing the importance of vectorization and its role in NumPy. As we mentioned previously, vectorization is generally the application of a common operation on the elements in an array. Due to the repeatability of the process, a vectorized operation can be optimized to be more efficient than its alternative implementation in, say, a for loop. However, the trade-off for this capability is that the elements in the array would need to be of the same data type—this is also a requirement for any NumPy array.

With that, let's move on to the following exercise, where we will see this effect in action.

Exercise 2.01: Timing Vectorized Operations in NumPy

In this exercise, we will calculate the speedup achieved by implementing various vectorized operations such as addition, multiplication, and square root calculation with NumPy arrays compared to a pure Python alternative without vectorization. To do this, perform the following steps:

In the first cell of a new Jupyter notebook, import the NumPy package and the Timer class from the timeit library. The latter will be used to implement our timing functionality:
```
import numpy as np
from timeit import Timer
```
In a new cell, initialize a Python list containing numbers ranging from 0 (inclusive) to 1,000,000 (exclusive) using the range() function, as well as its NumPy array counterpart using the np.array() function:
```
my_list = list(range(10 ** 6))
my_array = np.array(my_list)
```
We will now apply mathematical operations to this list and array in the following steps. In a new cell, write a function named for_add() that returns a list whose elements are the elements in the my_list variable with 1 added to each (we will use list comprehension for this). Write another function named vec_add() that returns the NumPy array version of the same data, which is simply my_array + 1:
```
def for_add():
    return [item + 1 for item in my_list]
def vec_add():
    return my_array + 1
```
In the next code cell, initialize two Timer objects while passing in each of the preceding two functions. These objects contain the interface that we will use to keep track of the speed of the functions.
Call the repeat() function on each of the objects with the arguments 10 and 10—in essence, we are repeating the timing experiment by 100 times. Finally, as the repeat() function returns a list of numbers representing how much time passed in each experiment for a given function we are recording, we print out the minimum of this list. In short, we want the time of the fastest run of each of the functions:
```
print('For-loop addition:')
print(min(Timer(for_add).repeat(10, 10)))
print('Vectorized addition:')
print(min(Timer(vec_add).repeat(10, 10)))
```
The following is the output that this program produced:
```
For-loop addition:
0.5640330809999909
Vectorized addition:
0.006047582000007878
```
While yours might be different, the relationship between the two numbers should be clear: the speed of the for loop addition function should be many times lower than that of the vectorized addition function.

In the next code cell, implement the same comparison of speed where we multiply the numbers by 2. For the NumPy array, simply return my_array * 2:

def for_mul():
    return [item * 2 for item in my_list]
def vec_mul():
    return my_array * 2
print('For-loop multiplication:')
print(min(Timer(for_mul).repeat(10, 10)))
print('Vectorized multiplication:')
print(min(Timer(vec_mul).repeat(10, 10)))

Verify from the output that the vectorized multiplication function is also faster than the for loop version. The output after running this code is as follows:

For-loop multiplication: 0.5431750800000259
Vectorized multiplication: 0.005795304000002943

In the next code cell, implement the same comparison where we compute the square root of the numbers. For the Python list, import and use the math.sqrt() function on each element in the list comprehension. For the NumPy array, return the expression np.sqrt(my_array):
```
import math
def for_sqrt():
    return [math.sqrt(item) for item in my_list]
def vec_sqrt():
    return np.sqrt(my_array)
print('For-loop square root:')
print(min(Timer(for_sqrt).repeat(10, 10)))
print('Vectorized square root:')
print(min(Timer(vec_sqrt).repeat(10, 10)))
```
Verify from the output that the vectorized square root function is once again faster than its for loop counterpart:
```
For-loop square root:
1.1018582749999268
Vectorized square root:
0.01677640299999439
```
Also, notice that the np.sqrt() function is implemented to be vectorized, which is why we were able to pass the whole array to the function.

This exercise introduced a few vectorized operations for NumPy arrays and demonstrated how much faster they are compared to their pure Python loop counterparts.

Note

To access the source code for this specific section, please refer to https://packt.live/38l3Nk7.

You can also run this example online at https://packt.live/2ZtBSdY.

That concludes the topic of vectorization in NumPy. In the next and final section on NumPy, we'll discuss another powerful feature that the package offers: random sampling.

Random Sampling

In the previous chapter, we saw an example of how to implement randomization in Python using the random library. However, the randomization in most of the methods implemented in that library is uniform, and in scientific computing and data science projects, sometimes, we need to draw samples from distributions other than the uniform one. This area is where NumPy once again offers a wide range of options.

Generally speaking, random sampling from a probability distribution is the process of selecting an instance from that probability distribution, where elements having a higher probability are more likely to be selected (or drawn). This concept is closely tied to the concept of a random variable in statistics. A random variable is typically used to model some unknown quantity in a statistical analysis, and it usually follows a given distribution, depending on what type of data it models. For example, the ages of members of a population are typically modeled using the normal distribution (also known as the bell curve or the Gaussian distribution), while the arrivals of customers to, say, a bank are often modeled using the Poisson distribution.

By randomly sampling a given distribution that is associated with a random variable, we can obtain an actual realization of the variable, from which we can perform various computations to obtain insights and inferences about the random variable in question.

We will revisit the concept and usage of probability distributions later in this book. For now, let's simply focus on the task at hand: how to draw samples from these distributions. This is done using the np.random package, which includes the interface that allows us to draw from various distributions.

For example, the following code snippet initializes a sample from the normal distribution (note that your output might be different from the following due to randomness):

>>> sample = np.random.normal()
>>> sample
-0.43658969989465696

You might be aware of the fact that the normal distribution is specified by two statistics: a mean and a standard deviation. These can be specified using the loc (whose default value is 0.0) and scale (whose default value is 1.0) arguments, respectively, in the np.random.normal() function, as follows:

>>> sample = np.random.normal(loc=100, scale=10)
>>> sample
80.31187658687652

It is also possible to draw multiple samples, as opposed to just a single sample, at once as a NumPy array. To do this, we specify the size argument of the np.random.normal() function with the desired shape of the output array. For example, here, we are creating a 2 x 3 matrix of samples drawn from the same normal distribution:

>>> samples = np.random.normal(loc=100, scale=10, size=(2, 3))
>>> samples
array([[ 82.7834678 , 109.16410976, 101.35105681],
       [112.54825751, 107.79073472,  77.70239823]])

This option allows us to take the output array and potentially apply other NumPy-specific operations to it (such as vectorization). The alternative is to sequentially draw individual samples into a list and convert it into a NumPy array afterward.

It is important to note that each probability distribution has its own statistic(s) that define it. The normal distribution, as we have seen, has a mean and a standard deviation, while the aforementioned Poisson distribution is defined with a λ (lambda) parameter, which is interpreted as the expectation of interval. Let's see this in an example:

>>> samples = np.random.poisson(lam=10, size=(2, 2))
>>> samples
array([[11, 10],
       [15, 11]])

Generally, before drawing a sample from a probability distribution in NumPy, you should always look up the corresponding documentation to see what arguments are available for that specific distribution and what their default values are.

Aside from probability distribution, NumPy also offers other randomness-related functionalities that can be found in the random module. For example, the np.random.randint() function returns a random integer between two given numbers; np.random.choice() randomly draws a sample from a given one-dimensional array; np.random.shuffle(), on the other hand, randomly shuffles a given sequence in-place.

These functionalities, which are demonstrated in the following code snippet, offer a significant degree of flexibility in terms of working with randomness in Python in general, and specifically in scientific computing:

>>> np.random.randint(low=0, high=10, size=(2, 5))
array([[6, 4, 1, 3, 6],
       [0, 8, 8, 8, 8]])
>>> np.random.choice([1, 3, 4, -6], size=(2, 2))
array([[1, 1],
       [1, 4]])
>>> a = [1, 2, 3, 4]
>>> for _ in range(3):
...        np.random.shuffle(a)
...        print(a)
[4, 1, 3, 2]
[4, 1, 2, 3]
[1, 2, 4, 3]

A final important topic that we need to discuss whenever there is randomness involved in programming is reproducibility. This term denotes the ability to obtain the same result from a program in a different run, especially when there are randomness-related elements in that program.

Reproducibility is essential when a bug exists in a program but only manifests itself in certain random cases. By forcing the program to generate the same random numbers every time it executes, we have another way to narrow down and identify this kind of bug aside from unit testing.

In data science and statistics, reproducibility is of the utmost importance. Without a program being reproducible, it is possible for one researcher to find a statistically significant result while another is unable to, even when the two have the same code and methods. This is why many practitioners have begun placing heavy emphasis on reproducibility in the fields of data science and machine learning.

The most common method to implement reproducibility (which is also the easiest to program) is to simply fix the seed of the program (specifically its libraries) that utilizes randomness. Fixing the seed of a randomness-related library ensures that the same random numbers will be generated across different runs of the same program. In other words, this allows for the same result to be produced, even if a program is run multiple times on different machines.

To do this, we can simply pass an integer to the appropriate seed function of the library/package that produces randomness for our programs. For example, to fix the seed for the random library, we can write the following code:

>>> import random
>>> random.seed(0)  # can use any other number

For the random package in NumPy, we can write the following:

>>> np.random.seed(0)

Setting the seed for these libraries/packages is generally a good practice when you are contributing to a group or an open source project; again, it ensures that all members of the team are able to achieve the same result and eliminates miscommunication.

This topic also concludes our discussion of the NumPy library. Next, we will move on to another integral part of the data science and scientific computing ecosystem in Python: the pandas library.

Filter reviews by

All

Packt verified reviews

Amazon verified reviews

Alexander Climaco Clark Jan 06, 2021

This gives a good thorough review of several sides to data science. Each lesson has fantastic exercises that walks through problems that illustrates aspects to help the reader to follow and learn by participating. From python models and statistics to the tools that make them easier, this book gives a robust learn set that can be used by people of all levels; whether learning to go into data science or just refreshing your skills, this will give you a challenge to improve overall knowledge. If adding this to your cart, I recommend reading ‘Naked Statistics’ as well. The two do well on process and understanding of data science and give you ways to explain projects and results in your job.

Amazon Verified review

Tulie Jan 11, 2021

I love this book; it is well-writen. I started reading the chapters but had surgery. I plan on going back to finish reading the other chapters as soon as I get better. Buying this book will be a good investment..

Marleen Jan 21, 2021

If you are looking for various exercises and examples in regards to Python and Math, get this book. There are so many follow along chapters that - if you do indeed follow along - you will learn a lot. The later chapters might also be worth to look at for already experienced Python users but in general this book is suitable for beginners too. The chapters on algebra, statistics, matrices and probability form the base for advanced analytics and machine learning. However, I like this workshop because it is not the typical "Introduction to ML" book but rather an all-round math book that will help you master various programming and day to day work challenges.

M. Sarverwest Mar 07, 2022

Easy to follow and the examples are great.

Abdulmajeed Muhammad Kabir Dec 25, 2020

The title "The Statistics and Calculus with Python Workshop"is exactly as it says. This book starts from programmingfoundations; Python, and exposes the reader to programmingconstructs such as variables, control statements, loops, andfunctions. It then moves into statistics, probability theory,mathematics, and some basic CS topics that are relevantin the software world e.g. testing and version control. The content andtopics covered would be useful for any prospective datascientist and this book could be a nice companion for early undergrads.

The Statistics and Calculus with Python Workshop: A comprehensive introduction to mathematics in Python for artificial intelligence applications

What do you get with Print?

The Statistics and Calculus with Python Workshop

2. Python's Main Tools for Statistics

Introduction

Scientific Computing and NumPy Basics

NumPy Arrays

Vectorization

Exercise 2.01: Timing Vectorized Operations in NumPy

Random Sampling

Page 1 of 6

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the 6 authors

FAQs

The Statistics and Calculus with Python Workshop: A comprehensive introduction to mathematics in Python for artificial intelligence applications

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the 6 authors

FAQs