You're reading from Bioinformatics with Python Cookbook Use modern Python libraries and applications to solve real-world computational biology problems

Product type Paperback

Published in Sep 2022

Publisher Packt

ISBN-13 9781803236421

Length 360 pages

Edition 3rd Edition

Languages

Python

Tools

Dask

Concepts

Bioinformatics

Author (1):

Tiago Antao

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Python and the Surrounding Software Ecology

2. Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib FREE CHAPTER

3. Chapter 3: Next-Generation Sequencing

4. Chapter 4: Advanced NGS Data Processing

5. Chapter 5: Working with Genomes

6. Chapter 6: Population Genetics

7. Chapter 7: Phylogenetics

8. Chapter 8: Using the Protein Data Bank

9. Chapter 9: Bioinformatics Pipelines

10. Chapter 10: Machine Learning for Bioinformatics

11. Chapter 11: Parallel Processing with Dask and Zarr

12. Chapter 12: Functional Programming for Bioinformatics

13. Index

Why subscribe?

14. Other Books You May Enjoy

Accelerating pandas processing with  Apache Arrow

When dealing with large amounts of data, such as in whole genome sequencing, pandas is both slow and memory-consuming. Apache Arrow provides faster and more memory-efficient implementations of several pandas operations and can interoperate with it.

Apache Arrow is a project co-founded by Wes McKinney, the founder of pandas, and it has several objectives, including working with tabular data in a language-agnostic way, which allows for language interoperability while providing a memory- and computation-efficient implementation. Here, we will only be concerned with the second part: getting more efficiency for large-data processing. We will do this in an integrated way with pandas.

Here, we will once again use VAERS data and show how Apache Arrow can be used to accelerate pandas data loading and reduce memory consumption.

Getting ready

Again, we will be using data from the first recipe. Be sure you download and prepare it, as explained in the Getting ready section of the Using pandas to process vaccine-adverse events recipe. The code is available in Chapter02/Arrow.py.

How to do it...

Follow these steps:

Let’s start by loading the data using both pandas and Arrow:

import gzip
import pandas as pd
from pyarrow import csv
import pyarrow.compute as pc 
vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
columns = list(vdata_pd.columns)
vdata_pd.info(memory_usage="deep") 
vdata_arrow = csv.read_csv("2021VAERSDATA.csv.gz")
tot_bytes = sum([
    vdata_arrow[name].nbytes
    for name in vdata_arrow.column_names])
print(f"Total {tot_bytes // (1024 ** 2)} MB")

pandas requires 1.3 GB, whereas Arrow requires 614 MB: less than half the memory. For large files like this, this may mean the difference between being able to process data in memory or needing to find another solution, such as Dask. While some functions in Arrow have similar names to pandas (for example, read_csv), that is not the most common occurrence. For example, note the way we compute the total size of the DataFrame: by getting the size of each column and performing a sum, which is a different approach from pandas.

Let’s do a side-by-side comparison of the inferred types:

for name in vdata_arrow.column_names:
    arr_bytes = vdata_arrow[name].nbytes
    arr_type = vdata_arrow[name].type
    pd_bytes = vdata_pd[name].memory_usage(index=False, deep=True)
    pd_type = vdata_pd[name].dtype
    print(
        name,
        arr_type, arr_bytes // (1024 ** 2),
        pd_type, pd_bytes // (1024 ** 2),)

Here is an abridged version of the output:

VAERS_ID int64 4 int64 4
RECVDATE string 8 object 41
STATE string 3 object 34
CAGE_YR int64 5 float64 4
SEX string 3 object 36
RPT_DATE string 2 object 20
DIED string 2 object 20
L_THREAT string 2 object 20
ER_VISIT string 2 object 19
HOSPITAL string 2 object 20
HOSPDAYS int64 5 float64 4

As you can see, Arrow is generally more specific with type inference and is one of the main reasons why memory usage is substantially lower.

Now, let’s do a time performance comparison:

%timeit pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
%timeit csv.read_csv("2021VAERSDATA.csv.gz")

On my computer, the results are as follows:

7.36 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.28 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Arrow’s implementation is three times faster. The results on your computer will vary as this is dependent on the hardware.

Let’s repeat the memory occupation comparison while not loading the SYMPTOM_TEXT column. This is a fairer comparison as most numerical datasets do not tend to have a very large text column:

vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1", usecols=lambda x: x != "SYMPTOM_TEXT")
vdata_pd.info(memory_usage="deep")
columns.remove("SYMPTOM_TEXT")
vdata_arrow = csv.read_csv(
    "2021VAERSDATA.csv.gz",
     convert_options=csv.ConvertOptions(include_columns=columns))
vdata_arrow.nbytes

pandas requires 847 MB, whereas Arrow requires 205 MB: four times less.

Our objective is to use Arrow to load data into pandas. For that, we need to convert the data structure:
```
vdata = vdata_arrow.to_pandas()
vdata.info(memory_usage="deep")
```

There are two very important points to be made here: the pandas representation created by Arrow uses only 1 GB, whereas the pandas representation, from its native read_csv, is 1.3 GB. This means that even if you use pandas to process data, Arrow can create a more compact representation to start with.

The preceding code has one problem regarding memory consumption: when the converter is running, it will require memory to hold both the pandas and the Arrow representations, hence defeating the purpose of using less memory. Arrow can self-destruct its representation while creating the pandas version, hence resolving the problem. The line for this is vdata = vdata_arrow.to_pandas(self_destruct=True).

There’s more...

If you have a very large DataFrame that cannot be processed by pandas, even after it’s been loaded by Arrow, then maybe Arrow can do all the processing as it has a computing engine as well. That being said, Arrow’s engine is, at the time of writing, substantially less complete in terms of functionality than pandas. Remember that Arrow has many other features, such as language interoperability, but we will not be making use of those in this book.