Here are a couple of little motivating examples about why we need vectorization when doing any kind of scientific computing. You will see what we mean by vectorization in the following example.
Let's perform a couple of simple calculations with Python. We have two examples:
- First, let's say we have some distances and times and we would like to calculate the speeds:
distances = [10, 15, 17, 26]
times = [0.3, 0.47, 0.55, 1.20]
# Calculate speeds with Python
speeds = []
for i in range(4):
speeds.append(distances[i]/times[i])
speeds
Here we have the speeds:
[33.333333333333336,
31.914893617021278,
30.909090909090907,
21.666666666666668]
An alternative to accomplish the same in Python methodology would be the following:
# An alternative
speeds = [d/t for d,t in zip(distances, times)]
- For our second motivating example, let's say we have a list of product quantities and their respective prices and that we would like to calculate the total of the purchase. The code in Python would look something like this:
product_quantities = [13, 5, 6, 10, 11]
prices = [1.2, 6.5, 1.0, 4.8, 5.0]
total = sum([q*p for q,p in zip(product_quantities, prices)])
total
This will give a total of 157.1.
The point of these examples is that, for this type of calculation, we need to perform operations element by element and in Python (and most programming languages) we do it by using for loops or list comprehensions (which are just convenient ways of writing for loops). Vectorization is a style of computer programming where operations are applied to arrays of individual elements; in other words, a vectorized operation is the application of the operation, element by element, without explicitly doing it with for loops.
Now let's take a look at the NumPy approach to doing the former operations:
- First, let's import the library:
import numpy as np
- Now let's do the speeds calculation. As you can see, this is very easy and natural: just add the mathematical definition of speed:
# calculating speeds
distances = np.array([10, 15, 17, 26])
times = np.array([0.3, 0.47, 0.55, 1.20])
speeds = distances / times
speeds
This is what the output looks like:
array([ 33.33333333, 31.91489362, 30.90909091, 21.66666667])
Now, the purchase calculation. Again, the code for running this calculation is much easier and more natural:
#Calculating the total of a purchase
product_quantities = np.array([13, 5, 6, 10, 11])
prices = np.array([1.2, 6.5, 1.0, 4.8, 5.0])
total = (product_quantities*prices).sum()
total
After running this calculation, you will see that we get the same total: 157.1.
Now let's talk about some of the basics of array creation, main attributes, and operations. This is of course by no means a complete introduction, but it will be enough for you to have a basic understanding of how NumPy arrays work.
As we saw before, we can create arrays from lists like so:
# arrays from lists
distances = [10, 15, 17, 26, 20]
times = [0.3, 0.47, 0.55, 1.20, 1.0]
distances = np.array(distances)
times = np.array(times)
If we pass a list of lists to np.array(), it will create a two-dimensional array. If passed a list of lists of lists (three nested lists), it will create a three-dimensional array, and so on and so forth:
A = np.array([[1, 2], [3, 4]])
This is how A looks:
array([[1, 2], [3, 4]])
Take a look at some of the array's main attributes. Let's create some arrays containing randomly generated numbers:
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(low=0, high=9, size=12) # 1D array
x2 = np.random.randint(low=0, high=9, size=(3, 4)) # 2D array
x3 = np.random.randint(low=0, high=9, size=(3, 4, 5)) # 3D array
print(x1, '\n')
print(x2, '\n')
print(x3, '\n')
Here are our arrays:
[5 0 3 3 7 3 5 2 4 7 6 8]
[[8 1 6 7]
[7 8 1 5]
[8 4 3 0]]
[[[3 5 0 2 3]
[8 1 3 3 3]
[7 0 1 0 4]
[7 3 2 7 2]]
[[0 0 4 5 5]
[6 8 4 1 4]
[8 1 1 7 3]
[6 7 2 0 3]]
[[5 4 4 6 4]
[4 3 4 4 8]
[4 3 7 5 5]
[0 1 5 3 0]]]
Important array attributes are the following:
- ndarray.ndim: The number of axes (dimensions) of the array.
- ndarray.shape: The dimensions of the array. This tuple of integers indicates the size of the array in each dimension.
- ndarray.size: The total number of elements of the array. This is equal to the product of the elements of shape.
- ndarray.dtype: An object describing the type of the elements in the array. One can create or specify dtype's using standard Python types. Also, NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("x3 size: ", x3.dtype)
The output is as follows:
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
x3 size: int32
One-dimensional arrays can be indexed, sliced, and iterated over, just like lists or other Python sequences:
>>> x1
array([5, 0, 3, 3, 7, 3, 5, 2, 4, 7, 6, 8])
>>> x1[5] # element with index 5
3
>>> x1[2:5] # slice from of elements in indexes 2,3 and 4
array([3, 3, 7])
>>> x1[-1] # the last element of the array
8
Multi-dimensional arrays have one index per axis. These indices are given in a tuple separated by commas:
one_to_twenty = np.arange(1,21) # integers from 1 to 20
>>> my_matrix = one_to_twenty.reshape(5,4) # transform to a 5-row by 4-
column matrix
>>> my_matrix
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
>>> my_matrix[2,3] # element in row 3, column 4 (remember Python is zeroindexed)
12
>>> my_matrix[:, 1] # each row in the second column of my_matrix
array([ 2, 6, 10, 14, 18])
>>> my_matrix[0:2,-1] # first and second row of the last column
array([4, 8])
>>> my_matrix[0,0] = -1 # setting the first element to -1
>>> my_matrix
The output of the preceding code is as follows:
array([[-1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16],
[17, 18, 19, 20]])
Finally, let's perform some mathematical operations on the former matrix, just to have some examples of how vectorization works:
>>> one_to_twenty = np.arange(1,21) # integers from 1 to 20
>>> my_matrix = one_to_twenty.reshape(5,4) # transform to a 5-row by 4-
column matrix
>>> # the following operations are done to every element of the matrix
>>> my_matrix + 5 # addition
array([[ 6, 7, 8, 9],
[10, 11, 12, 13],
[14, 15, 16, 17],
[18, 19, 20, 21],
[22, 23, 24, 25]])
>>> my_matrix / 2 # division
array([[ 0.5, 1. , 1.5, 2. ],
[ 2.5, 3. , 3.5, 4. ],
[ 4.5, 5. , 5.5, 6. ],
[ 6.5, 7. , 7.5, 8. ],
[ 8.5, 9. , 9.5, 10. ]])
>>> my_matrix ** 2 # exponentiation
array([[ 1, 4, 9, 16],
[ 25, 36, 49, 64],
[ 81, 100, 121, 144],
[169, 196, 225, 256],
[289, 324, 361, 400]], dtype=int32)
>>> 2**my_matrix # powers of 2
array([[ 2, 4, 8, 16],
[ 32, 64, 128, 256],
[ 512, 1024, 2048, 4096],
[ 8192, 16384, 32768, 65536],
[ 131072, 262144, 524288, 1048576]], dtype=int32)
>>> np.sin(my_matrix) # mathematical functions like sin
array([[ 0.84147098, 0.90929743, 0.14112001, -0.7568025 ],
[-0.95892427, -0.2794155 , 0.6569866 , 0.98935825],
[ 0.41211849, -0.54402111, -0.99999021, -0.53657292],
[ 0.42016704, 0.99060736, 0.65028784, -0.28790332],
[-0.96139749, -0.75098725, 0.14987721, 0.91294525]])
Finally, let's take a look at some useful methods commonly used in data analysis:
>>> # some useful methods for analytics
>>> my_matrix.sum()
210
>>> my_matrix.max() ## maximum
20
>>> my_matrix.min() ## minimum
1
>>> my_matrix.mean() ## arithmetic mean
10.5
>>> my_matrix.std() ## standard deviation
5.766281297335398
I don't want to reinvent the wheel here; there are many excellent resources on the basics of NumPy.
If you go through the official quick start tutorial, available at
https://docs.scipy.org/doc/numpy/user/quickstart.html, you will have more than enough background to follow the materials in the book. If you want to go deeper, please also take a look at the references.