Dealing with the pitfalls of joining pandas DataFrames
The previous recipe was a whirlwind tour that introduced pandas and exposed most of the features that we will use in this book. While an exhaustive discussion about pandas would require a complete book, in this recipe – and in the next one – we are going to discuss topics that impact data analysis and are seldom discussed in the literature but are very important.
In this recipe, we are going to discuss some pitfalls that deal with relating DataFrames through joins: it turns out that many data analysis errors are introduced by carelessly joining data. We will introduce techniques to reduce such problems here.
Getting ready
We will be using the same data as in the previous recipe, but we will jumble it a bit so that we can discuss typical data analysis pitfalls. Once again, we will be joining the main adverse events table with the vaccination table, but we will randomly sample 90% of the data from each. This mimics, for example, the scenario where you only have incomplete information. This is one of the many examples where joins between tables do not have intuitively obvious results.
Use the following code to prepare our files by randomly sampling 90% of the data:
vdata = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1") vdata.sample(frac=0.9).to_csv("vdata_sample.csv.gz", index=False) vax = pd.read_csv("2021VAERSVAX.csv.gz", encoding="iso-8859-1") vax.sample(frac=0.9).to_csv("vax_sample.csv.gz", index=False)
Because this code involves random sampling, the results that you will get will be different from the ones reported here. If you want to get the same results, I have provided the files that I used in the Chapter02
directory. The code for this recipe can be found in Chapter02/Pandas_Join.py
.
How to do it...
Follow these steps:
- Let’s start by doing an inner join of the individual and vaccine tables:
vdata = pd.read_csv("vdata_sample.csv.gz") vax = pd.read_csv("vax_sample.csv.gz") vdata_with_vax = vdata.join( vax.set_index("VAERS_ID"), on="VAERS_ID", how="inner") len(vdata), len(vax), len(vdata_with_vax)
The len
output for this code is 589,487 for the individual data, 620,361 for the vaccination data, and 558,220 for the join. This suggests that some individual and vaccine data was not captured.
- Let’s find the data that was not captured with the following join:
lost_vdata = vdata.loc[~vdata.index.isin(vdata_with_vax.index)] lost_vdata lost_vax = vax[~vax["VAERS_ID"].isin(vdata.index)] lost_vax
You will see that 56,524 rows of individual data aren’t joined and that there are 62,141 rows of vaccinated data.
- There are other ways to join data. The default way is by performing a left outer join:
vdata_with_vax_left = vdata.join( vax.set_index("VAERS_ID"), on="VAERS_ID") vdata_with_vax_left.groupby("VAERS_ID").size().sort_values()
A left outer join assures that all the rows on the left table are always represented. If there are no rows on the right, then all the right columns will be filled with None
values.
Warning
There is a caveat that you should be careful with. Remember that the left table – vdata
– had one entry per VAERS_ID
. When you left join, you may end up with a case where the left-hand side is repeated several times. For example, the groupby
operation that we did previously shows that VAERS_ID
of 962303 has 11 entries. This is correct, but it’s not uncommon to have the incorrect expectation that you will still have a single row on the output per row on the left-hand side. This is because the left join returns 1 or more left entries, whereas the inner join above returns 0 or 1 entries, where sometimes, we would like to have precisely 1 entry. Be sure to always test the output for what you want in terms of the number of entries.
- There is a right join as well. Let’s right join COVID vaccines – the left table – with death events – the right table:
dead = vdata[vdata.DIED == "Y"] vax19 = vax[vax.VAX_TYPE == "COVID19"] vax19_dead = vax19.join(dead.set_index("VAERS_ID"), on="VAERS_ID", how="right") len(vax19), len(dead), len(vax19_dead) len(vax19_dead[vax19_dead.VAERS_ID.duplicated()]) len(vax19_dead) - len(dead)
As you may expect, a right join will ensure that all the rows on the right table are represented. So, we end up with 583,817 COVID entries, 7,670 dead entries, and a right join of 8,624 entries.
We also check the number of duplicated entries on the joined table and we get 954. If we subtract the length of the dead table from the joined table, we also get, as expected, 954. Make sure you have checks like this when you’re making joins.
- Finally, we are going to revisit the problematic COVID lot calculations since we now understand that we might be overcounting lots:
vax19_dead["STATE"] = vax19_dead["STATE"].str.upper() dead_lot = vax19_dead[["VAERS_ID", "VAX_LOT", "STATE"]].set_index(["VAERS_ID", "VAX_LOT"]) dead_lot_clean = dead_lot[~dead_lot.index.duplicated()] dead_lot_clean = dead_lot_clean.reset_index() dead_lot_clean[dead_lot_clean.VAERS_ID.isna()] baddies = dead_lot_clean.groupby("VAX_LOT").size().sort_values(ascending=False) for i, (lot, cnt) in enumerate(baddies.items()): print(lot, cnt, len(dead_lot_clean[dead_lot_clean.VAX_LOT == lot].groupby("STATE"))) if i == 10: break
Note that the strategies that we’ve used here ensure that we don’t get repeats: first, we limit the number of columns to the ones we will be using, then we remove repeated indexes and empty VAERS_ID
. This ensures no repetition of the VAERS_ID
, VAX_LOT
pair, and that no lots are associated with no IDs.
There’s more...
There are other types of joins other than left, inner, and right. Most notably, there is the outer join, which assures all entries from both tables have representation.
Make sure you have tests and assertions for your joins: a very common bug is having the wrong expectations for how joins behave. You should also make sure that there are no empty values on the columns where you are joining, as they can produce a lot of excess tuples.