You're reading from Bioinformatics with Python Cookbook Use modern Python libraries and applications to solve real-world computational biology problems

Product type Paperback

Published in Sep 2022

Publisher Packt

ISBN-13 9781803236421

Length 360 pages

Edition 3rd Edition

Languages

Python

Tools

Dask

Concepts

Bioinformatics

Author (1):

Tiago Antao

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: Python and the Surrounding Software Ecology

2. Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib FREE CHAPTER

3. Chapter 3: Next-Generation Sequencing

4. Chapter 4: Advanced NGS Data Processing

5. Chapter 5: Working with Genomes

6. Chapter 6: Population Genetics

7. Chapter 7: Phylogenetics

8. Chapter 8: Using the Protein Data Bank

9. Chapter 9: Bioinformatics Pipelines

10. Chapter 10: Machine Learning for Bioinformatics

11. Chapter 11: Parallel Processing with Dask and Zarr

12. Chapter 12: Functional Programming for Bioinformatics

13. Index

Why subscribe?

14. Other Books You May Enjoy

Dealing with the pitfalls of joining pandas DataFrames

The previous recipe was a whirlwind tour that introduced pandas and exposed most of the features that we will use in this book. While an exhaustive discussion about pandas would require a complete book, in this recipe – and in the next one – we are going to discuss topics that impact data analysis and are seldom discussed in the literature but are very important.

In this recipe, we are going to discuss some pitfalls that deal with relating DataFrames through joins: it turns out that many data analysis errors are introduced by carelessly joining data. We will introduce techniques to reduce such problems here.

Getting ready

We will be using the same data as in the previous recipe, but we will jumble it a bit so that we can discuss typical data analysis pitfalls. Once again, we will be joining the main adverse events table with the vaccination table, but we will randomly sample 90% of the data from each. This mimics, for example, the scenario where you only have incomplete information. This is one of the many examples where joins between tables do not have intuitively obvious results.

Use the following code to prepare our files by randomly sampling 90% of the data:

vdata = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
vdata.sample(frac=0.9).to_csv("vdata_sample.csv.gz", index=False)
vax = pd.read_csv("2021VAERSVAX.csv.gz", encoding="iso-8859-1")
vax.sample(frac=0.9).to_csv("vax_sample.csv.gz", index=False)

Because this code involves random sampling, the results that you will get will be different from the ones reported here. If you want to get the same results, I have provided the files that I used in the Chapter02 directory. The code for this recipe can be found in Chapter02/Pandas_Join.py.

How to do it...

Follow these steps:

Let’s start by doing an inner join of the individual and vaccine tables:

vdata = pd.read_csv("vdata_sample.csv.gz")
vax = pd.read_csv("vax_sample.csv.gz")
vdata_with_vax = vdata.join(
    vax.set_index("VAERS_ID"),
    on="VAERS_ID",
    how="inner")
len(vdata), len(vax), len(vdata_with_vax)

The len output for this code is 589,487 for the individual data, 620,361 for the vaccination data, and 558,220 for the join. This suggests that some individual and vaccine data was not captured.

Let’s find the data that was not captured with the following join:

lost_vdata = vdata.loc[~vdata.index.isin(vdata_with_vax.index)]
lost_vdata
lost_vax = vax[~vax["VAERS_ID"].isin(vdata.index)]
lost_vax

You will see that 56,524 rows of individual data aren’t joined and that there are 62,141 rows of vaccinated data.

There are other ways to join data. The default way is by performing a left outer join:

vdata_with_vax_left = vdata.join(
    vax.set_index("VAERS_ID"),
    on="VAERS_ID")
vdata_with_vax_left.groupby("VAERS_ID").size().sort_values()

A left outer join assures that all the rows on the left table are always represented. If there are no rows on the right, then all the right columns will be filled with None values.

Warning

There is a caveat that you should be careful with. Remember that the left table – vdata – had one entry per VAERS_ID. When you left join, you may end up with a case where the left-hand side is repeated several times. For example, the groupby operation that we did previously shows that VAERS_ID of 962303 has 11 entries. This is correct, but it’s not uncommon to have the incorrect expectation that you will still have a single row on the output per row on the left-hand side. This is because the left join returns 1 or more left entries, whereas the inner join above returns 0 or 1 entries, where sometimes, we would like to have precisely 1 entry. Be sure to always test the output for what you want in terms of the number of entries.

There is a right join as well. Let’s right join COVID vaccines – the left table – with death events – the right table:

dead = vdata[vdata.DIED == "Y"]
vax19 = vax[vax.VAX_TYPE == "COVID19"]
vax19_dead = vax19.join(dead.set_index("VAERS_ID"), on="VAERS_ID", how="right")
len(vax19), len(dead), len(vax19_dead)
len(vax19_dead[vax19_dead.VAERS_ID.duplicated()])
len(vax19_dead) - len(dead)

As you may expect, a right join will ensure that all the rows on the right table are represented. So, we end up with 583,817 COVID entries, 7,670 dead entries, and a right join of 8,624 entries.

We also check the number of duplicated entries on the joined table and we get 954. If we subtract the length of the dead table from the joined table, we also get, as expected, 954. Make sure you have checks like this when you’re making joins.

Finally, we are going to revisit the problematic COVID lot calculations since we now understand that we might be overcounting lots:

vax19_dead["STATE"] = vax19_dead["STATE"].str.upper()
dead_lot = vax19_dead[["VAERS_ID", "VAX_LOT", "STATE"]].set_index(["VAERS_ID", "VAX_LOT"])
dead_lot_clean = dead_lot[~dead_lot.index.duplicated()]
dead_lot_clean = dead_lot_clean.reset_index()
dead_lot_clean[dead_lot_clean.VAERS_ID.isna()]
baddies = dead_lot_clean.groupby("VAX_LOT").size().sort_values(ascending=False)
for i, (lot, cnt) in enumerate(baddies.items()):
    print(lot, cnt, len(dead_lot_clean[dead_lot_clean.VAX_LOT == lot].groupby("STATE")))
    if i == 10:
        break

Note that the strategies that we’ve used here ensure that we don’t get repeats: first, we limit the number of columns to the ones we will be using, then we remove repeated indexes and empty VAERS_ID. This ensures no repetition of the VAERS_ID, VAX_LOT pair, and that no lots are associated with no IDs.

There’s more...

There are other types of joins other than left, inner, and right. Most notably, there is the outer join, which assures all entries from both tables have representation.

Make sure you have tests and assertions for your joins: a very common bug is having the wrong expectations for how joins behave. You should also make sure that there are no empty values on the columns where you are joining, as they can produce a lot of excess tuples.