Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Bioinformatics with Python Cookbook

You're reading from   Bioinformatics with Python Cookbook Use modern Python libraries and applications to solve real-world computational biology problems

Arrow left icon
Product type Paperback
Published in Sep 2022
Publisher Packt
ISBN-13 9781803236421
Length 360 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Tiago Antao Tiago Antao
Author Profile Icon Tiago Antao
Tiago Antao
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: Python and the Surrounding Software Ecology 2. Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib FREE CHAPTER 3. Chapter 3: Next-Generation Sequencing 4. Chapter 4: Advanced NGS Data Processing 5. Chapter 5: Working with Genomes 6. Chapter 6: Population Genetics 7. Chapter 7: Phylogenetics 8. Chapter 8: Using the Protein Data Bank 9. Chapter 9: Bioinformatics Pipelines 10. Chapter 10: Machine Learning for Bioinformatics 11. Chapter 11: Parallel Processing with Dask and Zarr 12. Chapter 12: Functional Programming for Bioinformatics 13. Index 14. Other Books You May Enjoy

Accelerating pandas processing with 
Apache Arrow

When dealing with large amounts of data, such as in whole genome sequencing, pandas is both slow and memory-consuming. Apache Arrow provides faster and more memory-efficient implementations of several pandas operations and can interoperate with it.

Apache Arrow is a project co-founded by Wes McKinney, the founder of pandas, and it has several objectives, including working with tabular data in a language-agnostic way, which allows for language interoperability while providing a memory- and computation-efficient implementation. Here, we will only be concerned with the second part: getting more efficiency for large-data processing. We will do this in an integrated way with pandas.

Here, we will once again use VAERS data and show how Apache Arrow can be used to accelerate pandas data loading and reduce memory consumption.

Getting ready

Again, we will be using data from the first recipe. Be sure you download and prepare it, as explained in the Getting ready section of the Using pandas to process vaccine-adverse events recipe. The code is available in Chapter02/Arrow.py.

How to do it...

Follow these steps:

  1. Let’s start by loading the data using both pandas and Arrow:
    import gzip
    import pandas as pd
    from pyarrow import csv
    import pyarrow.compute as pc 
    vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    columns = list(vdata_pd.columns)
    vdata_pd.info(memory_usage="deep") 
    vdata_arrow = csv.read_csv("2021VAERSDATA.csv.gz")
    tot_bytes = sum([
        vdata_arrow[name].nbytes
        for name in vdata_arrow.column_names])
    print(f"Total {tot_bytes // (1024 ** 2)} MB")

pandas requires 1.3 GB, whereas Arrow requires 614 MB: less than half the memory. For large files like this, this may mean the difference between being able to process data in memory or needing to find another solution, such as Dask. While some functions in Arrow have similar names to pandas (for example, read_csv), that is not the most common occurrence. For example, note the way we compute the total size of the DataFrame: by getting the size of each column and performing a sum, which is a different approach from pandas.

  1. Let’s do a side-by-side comparison of the inferred types:
    for name in vdata_arrow.column_names:
        arr_bytes = vdata_arrow[name].nbytes
        arr_type = vdata_arrow[name].type
        pd_bytes = vdata_pd[name].memory_usage(index=False, deep=True)
        pd_type = vdata_pd[name].dtype
        print(
            name,
            arr_type, arr_bytes // (1024 ** 2),
            pd_type, pd_bytes // (1024 ** 2),)

Here is an abridged version of the output:

VAERS_ID int64 4 int64 4
RECVDATE string 8 object 41
STATE string 3 object 34
CAGE_YR int64 5 float64 4
SEX string 3 object 36
RPT_DATE string 2 object 20
DIED string 2 object 20
L_THREAT string 2 object 20
ER_VISIT string 2 object 19
HOSPITAL string 2 object 20
HOSPDAYS int64 5 float64 4

As you can see, Arrow is generally more specific with type inference and is one of the main reasons why memory usage is substantially lower.

  1. Now, let’s do a time performance comparison:
    %timeit pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    %timeit csv.read_csv("2021VAERSDATA.csv.gz")

On my computer, the results are as follows:

7.36 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.28 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Arrow’s implementation is three times faster. The results on your computer will vary as this is dependent on the hardware.

  1. Let’s repeat the memory occupation comparison while not loading the SYMPTOM_TEXT column. This is a fairer comparison as most numerical datasets do not tend to have a very large text column:
    vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1", usecols=lambda x: x != "SYMPTOM_TEXT")
    vdata_pd.info(memory_usage="deep")
    columns.remove("SYMPTOM_TEXT")
    vdata_arrow = csv.read_csv(
        "2021VAERSDATA.csv.gz",
         convert_options=csv.ConvertOptions(include_columns=columns))
    vdata_arrow.nbytes

pandas requires 847 MB, whereas Arrow requires 205 MB: four times less.

  1. Our objective is to use Arrow to load data into pandas. For that, we need to convert the data structure:
    vdata = vdata_arrow.to_pandas()
    vdata.info(memory_usage="deep")

There are two very important points to be made here: the pandas representation created by Arrow uses only 1 GB, whereas the pandas representation, from its native read_csv, is 1.3 GB. This means that even if you use pandas to process data, Arrow can create a more compact representation to start with.

The preceding code has one problem regarding memory consumption: when the converter is running, it will require memory to hold both the pandas and the Arrow representations, hence defeating the purpose of using less memory. Arrow can self-destruct its representation while creating the pandas version, hence resolving the problem. The line for this is vdata = vdata_arrow.to_pandas(self_destruct=True).

There’s more...

If you have a very large DataFrame that cannot be processed by pandas, even after it’s been loaded by Arrow, then maybe Arrow can do all the processing as it has a computing engine as well. That being said, Arrow’s engine is, at the time of writing, substantially less complete in terms of functionality than pandas. Remember that Arrow has many other features, such as language interoperability, but we will not be making use of those in this book.

You have been reading a chapter from
Bioinformatics with Python Cookbook - Third Edition
Published in: Sep 2022
Publisher: Packt
ISBN-13: 9781803236421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime