You're reading from Bioinformatics with Python Cookbook Use modern Python libraries and applications to solve real-world computational biology problems

Product type Paperback

Published in Sep 2022

Publisher Packt

ISBN-13 9781803236421

Length 360 pages

Edition 3rd Edition

Languages

Python

Tools

Dask

Concepts

Bioinformatics

Author (1):

Tiago Antao

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Python and the Surrounding Software Ecology

2. Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib FREE CHAPTER

3. Chapter 3: Next-Generation Sequencing

4. Chapter 4: Advanced NGS Data Processing

5. Chapter 5: Working with Genomes

6. Chapter 6: Population Genetics

7. Chapter 7: Phylogenetics

8. Chapter 8: Using the Protein Data Bank

9. Chapter 9: Bioinformatics Pipelines

10. Chapter 10: Machine Learning for Bioinformatics

11. Chapter 11: Parallel Processing with Dask and Zarr

12. Chapter 12: Functional Programming for Bioinformatics

13. Index

Why subscribe?

14. Other Books You May Enjoy

Understanding NumPy as the engine behind Python data science and bioinformatics

Most of your analysis will make use of NumPy, even if you don’t use it explicitly. NumPy is an array manipulation library that is behind libraries such as pandas, Matplotlib, Biopython, and scikit-learn, among many others. While much of your bioinformatics work may not require explicit direct use of NumPy, you should be aware of its existence as it underpins almost everything you do, even if only indirectly via the other libraries.

In this recipe, we will use VAERS data to demonstrate how NumPy is behind many of the core libraries that we use. This is a very light introduction to the library so that you are aware that it exists and that it is behind almost everything. Our example will extract the number of cases from the five US states with more adverse effects, splitting them into age bins: 0 to 19 years, 20 to 39, up to 100 to 119.

Getting ready

Once again, we will be using the data from the first recipe, so make sure it’s available. The code for it can be found in Chapter02/NumPy.py.

How to do it…

Follow these steps:

Let’s start by loading the data with pandas and reducing the data so that it’s related to the top five US states only:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
vdata = pd.read_csv(
    "2021VAERSDATA.csv.gz", encoding="iso-8859-1")
vdata["STATE"] = vdata["STATE"].str.upper()
top_states = pd.DataFrame({
    "size": vdata.groupby("STATE").size().sort_values(ascending=False).head(5)}).reset_index()
top_states["rank"] = top_states.index
top_states = top_states.set_index("STATE")
top_vdata = vdata[vdata["STATE"].isin(top_states.index)]
top_vdata["state_code"] = top_vdata["STATE"].apply(
    lambda state: top_states["rank"].at[state]
).astype(np.uint8)
top_vdata = top_vdata[top_vdata["AGE_YRS"].notna()]
top_vdata.loc[:,"AGE_YRS"] = top_vdata["AGE_YRS"].astype(int)
top_states

The top states are as follows. This rank will be used later to construct a NumPy matrix:

Figure 2.2 – US states with largest numbers of adverse effects

Now, let’s extract the two NumPy arrays that contain age and state data:

age_state = top_vdata[["state_code", "AGE_YRS"]]
age_state["state_code"]
state_code_arr = age_state["state_code"].values
type(state_code_arr), state_code_arr.shape, state_code_arr.dtype
age_arr = age_state["AGE_YRS"].values
type(age_arr), age_arr.shape, age_arr.dtype

Note that the data that underlies pandas is NumPy data (the values call for both Series returns NumPy types). Also, you may recall that pandas has properties such as .shape or .dtype: these were inspired by NumPy and behave the same.

Now, let’s create a NumPy matrix from scratch (a 2D array), where each row is a state and each column represents an age group:

age_state_mat = np.zeros((5,6), dtype=np.uint64)
for row in age_state.itertuples():
    age_state_mat[row.state_code, row.AGE_YRS//20] += 1
age_state_mat

The array has five rows – one for each state – and six columns – one for each age group. All the cells in the array must have the same type.

We initialize the array with zeros. There are many ways to initialize arrays, but if you have a very large array, initializing it may take a lot of time. Sometimes, depending on your task, it might be OK that the array is empty at the beginning (meaning it was initialized with random trash). In that case, using np.empty will be much faster. We use pandas iteration here: this is not the best way to do things from a pandas perspective, but we want to make the NumPy part very explicit.

We can extract a single row – in our case, the data for a state – very easily. The same applies to a column. Let’s take California data and then the 0-19 age group:
```
cal = age_state_mat[0,:]
kids = age_state_mat[:,0]
```

Note the syntax to extract a row or a column. It should be familiar to you, given that pandas copied the syntax from NumPy and we encountered it in previous recipes.

Now, let’s compute a new matrix where we have the fraction of cases per age group:

def compute_frac(arr_1d):
    return arr_1d / arr_1d.sum()
frac_age_stat_mat = np.apply_along_axis(compute_frac, 1, age_state_mat)

The last line applies the compute_frac function to all rows. compute_frac takes a single row and returns a new row where all the elements are divided by the total sum.

Now, let’s create a new matrix that acts as a percentage instead of a fraction – simply because it reads better:

perc_age_stat_mat = frac_age_stat_mat * 100
perc_age_stat_mat = perc_age_stat_mat.astype(np.uint8)
perc_age_stat_mat

The first line simply multiplies all the elements of the 2D array by 100. Matplotlib is smart enough to traverse different array structures. That line will work if it’s presented with an array with any dimensions and would do exactly what is expected.

Here is the result:

Figure 2.3 – A matrix representing the distribution of vaccine-adverse effects in the five US states with the most cases

Finally, let’s create a graphical representation of the matrix using Matplotlib:

fig = plt.figure()
ax = fig.add_subplot()
ax.matshow(perc_age_stat_mat, cmap=plt.get_cmap("Greys"))
ax.set_yticks(range(5))
ax.set_yticklabels(top_states.index)
ax.set_xticks(range(6))
ax.set_xticklabels(["0-19", "20-39", "40-59", "60-79", "80-99", "100-119"])
fig.savefig("matrix.png")

Do not dwell too much on the Matplotlib code – we are going to discuss it in the next recipe. The fundamental point here is that you can pass NumPy data structures to Matplotlib. Matplotlib, like pandas, is based on NumPy.

You're reading from Bioinformatics with Python Cookbook Use modern Python libraries and applications to solve real-world computational biology problems

Table of Contents (15) Chapters

Understanding NumPy as the engine behind Python data science and bioinformatics

Getting ready

How to do it…

See also

Authors (1)

Personalised recommendations for you