Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Bioinformatics with Python Cookbook
Bioinformatics with Python Cookbook

Bioinformatics with Python Cookbook: Use modern Python libraries and applications to solve real-world computational biology problems , Third Edition

eBook
€8.99 €34.99
Paperback
€43.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Bioinformatics with Python Cookbook

Getting to Know NumPy, pandas, Arrow, and Matplotlib

One of Python’s biggest strengths is its profusion of high-quality science and data processing libraries. At the core of all of them is NumPy, which provides efficient array and matrix support. On top of NumPy, we can find almost all of the scientific libraries. For example, in our field, there’s Biopython. But other generic data analysis libraries can also be used in our field. For example, pandas is the de facto standard for processing tabled data. More recently, Apache Arrow provides efficient implementations of some of pandas’ functionality, along with language interoperability. Finally, Matplotlib is the most common plotting library in the Python space and is appropriate for scientific computing. While these are general libraries with wide applicability, they are fundamental for bioinformatics processing, so we will study them in this chapter.

We will start by looking at pandas as it provides a high-level library with very broad practical applicability. Then, we’ll introduce Arrow, which we will use only in the scope of supporting pandas. After that, we’ll discuss NumPy, the workhorse behind almost everything we do. Finally, we’ll introduce Matplotlib.

Our recipes are very introductory – each of these libraries could easily occupy a full book, but the recipes should be enough to help you through this book. If you are using Docker, and because all these libraries are fundamental for data analysis, they can be found in the tiagoantao/bioinformatics_base Docker image from Chapter 1.

In this chapter, we will cover the following recipes:

  • Using pandas to process vaccine-adverse events
  • Dealing with the pitfalls of joining pandas DataFrames
  • Reducing the memory usage of pandas DataFrames
  • Accelerating pandas processing with Apache Arrow
  • Understanding NumPy as the engine behind Python data science and bioinformatics
  • Introducing Matplotlib for chart generation

Using pandas to process vaccine-adverse events

We will be introducing pandas with a concrete bioinformatics data analysis example: we will be studying data from the Vaccine Adverse Event Reporting System (VAERS, https://vaers.hhs.gov/). VAERS, which is maintained by the US Department of Health and Human Services, includes a database of vaccine-adverse events going back to 1990.

VAERS makes data available in comma-separated values (CSV) format. The CSV format is quite simple and can even be opened with a simple text editor (be careful with very large file sizes as they may crash your editor) or a spreadsheet such as Excel. pandas can work very easily with this format.

Getting ready

First, we need to download the data. It is available at https://vaers.hhs.gov/data/datasets.html. Please download the ZIP file: we will be using the 2021 file; do not download a single CSV file only. After downloading the file, unzip it, and then recompress all the files individually with gzip –9 *csv to save disk space.

Feel free to have a look at the files with a text editor, or preferably with a tool such as less (zless for compressed files). You can find documentation for the content of the files at https://vaers.hhs.gov/docs/VAERSDataUseGuide_en_September2021.pdf.

If you are using the Notebooks, code is provided at the beginning of them so that you can take care of the necessary processing. If you are using Docker, the base image is enough.

The code can be found in Chapter02/Pandas_Basic.py.

How to do it...

Follow these steps:

  1. Let’s start by loading the main data file and gathering the basic statistics:
    vdata = pd.read_csv(
        "2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    vdata.columns
    vdata.dtypes
    vdata.shape

We start by loading the data. In most cases, there is no need to worry about the text encoding as the default, UTF-8, will work, but in this case, the text encoding is legacy iso-8859-1. Then, we print the column names, which start with VAERS_ID, RECVDATE, STATE, AGE_YRS, and so on. They include 35 entries corresponding to each of the columns. Then, we print the types of each column. Here are the first few entries:

VAERS_ID          int64
RECVDATE         object
STATE            object
AGE_YRS         float64
CAGE_YR         float64
CAGE_MO         float64
SEX              object

By doing this, we get the shape of the data: (654986, 35). This means 654,986 rows and 35 columns. You can use any of the preceding strategies to get the information you need regarding the metadata of the table.

  1. Now, let’s explore the data:
    vdata.iloc[0]
    vdata = vdata.set_index("VAERS_ID")
    vdata.loc[916600]
    vdata.head(3)
    vdata.iloc[:3]
    vdata.iloc[:5, 2:4]

There are many ways we can look at the data. We will start by inspecting the first row, based on location. Here is an abridged version:

VAERS_ID                                       916600
RECVDATE                                       01/01/2021
STATE                                          TX
AGE_YRS                                        33.0
CAGE_YR                                        33.0
CAGE_MO                                        NaN
SEX                                            F

TODAYS_DATE                                          01/01/2021
BIRTH_DEFECT                                  NaN
OFC_VISIT                                     Y
ER_ED_VISIT                                       NaN
ALLERGIES                                       Pcn and bee venom

After we index by VAERS_ID, we can use one ID to get a row. We can use 916600 (which is the ID from the preceding record) and get the same result.

Then, we retrieve the first three rows. Notice the two different ways we can do so:

  • Using the head method
  • Using the more general array specification; that is, iloc[:3]

Finally, we retrieve the first five rows, but only the second and third columns –iloc[:5, 2:4]. Here is the output:

          AGE_YRS  CAGE_YR
VAERS_ID                  
916600       33.0     33.0
916601       73.0     73.0
916602       23.0     23.0
916603       58.0     58.0
916604       47.0     47.0
  1. Let’s do some basic computations now, namely computing the maximum age in the dataset:
    vdata["AGE_YRS"].max()
    vdata.AGE_YRS.max()

The maximum value is 119 years. More importantly than the result, notice the two dialects for accessing AGE_YRS (as a dictionary key and as an object field) for the access columns.

  1. Now, let’s plot the ages involved:
    vdata["AGE_YRS"].sort_values().plot(use_index=False)
    vdata["AGE_YRS"].plot.hist(bins=20) 

This generates two plots (a condensed version is shown in the following step). We use pandas plotting machinery here, which uses Matplotib underneath.

  1. While we have a full recipe for charting with Matplotlib (Introducing Matplotlib for chart generation), let’s have a sneak peek here by using it directly:
    import matplotlib.pylot as plt
    fig, ax = plt.subplots(1, 2, sharey=True)
    fig.suptitle("Age of adverse events")
    vdata["AGE_YRS"].sort_values().plot(
        use_index=False, ax=ax[0],
        xlabel="Obervation", ylabel="Age")
    vdata["AGE_YRS"].plot.hist(bins=20, orientation="horizontal")

This includes both figures from the previous steps. Here is the output:

Figure 2.1 – Left – the age for each observation of adverse effect; 
right – a histogram showing the distribution of ages

Figure 2.1 – Left – the age for each observation of adverse effect; right – a histogram showing the distribution of ages

  1. We can also take a non-graphical, more analytical approach, such as counting the events per year:
    vdata["AGE_YRS"].dropna().apply(lambda x: int(x)).value_counts()

The output will be as follows:

50     11006
65     10948
60     10616
51     10513
58     10362
      ...
  1. Now, let’s see how many people died:
    vdata.DIED.value_counts(dropna=False)
    vdata["is_dead"] = (vdata.DIED == "Y")

The output of the count is as follows:

NaN    646450
Y        8536
Name: DIED, dtype: int64

Note that the type of DIED is not a Boolean. It’s more declarative to have a Boolean representation of a Boolean characteristic, so we create is_dead for it.

Tip

Here, we are assuming that NaN is to be interpreted as False. In general, we must be careful with the interpretation of NaN. It may mean False or it may simply mean – as in most cases – a lack of data. If that were the case, it should not be converted into False.

  1. Now, let’s associate the individual data about deaths with the type of vaccine involved:
    dead = vdata[vdata.is_dead]
    vax = pd.read_csv("2021VAERSVAX.csv.gz", encoding="iso-8859-1").set_index("VAERS_ID")
    vax.groupby("VAX_TYPE").size().sort_values()
    vax19 = vax[vax.VAX_TYPE == "COVID19"]
    vax19_dead = dead.join(vax19)

After we get a DataFrame containing just deaths, we must read the data that contains vaccine information. First, we must do some exploratory analysis of the types of vaccines and their adverse events. Here is the abridged output:

           …
HPV9         1506
FLU4         3342
UNK          7941
VARZOS      11034
COVID19    648723

After that, we must choose just the COVID-related vaccines and join them with individual data.

  1. Finally, let’s see the top 10 COVID vaccine lots that are overrepresented in terms of deaths and how many US states were affected by each lot:
    baddies = vax19_dead.groupby("VAX_LOT").size().sort_values(ascending=False)
    for I, (lot, cnt) in enumerate(baddies.items()):
        print(lot, cnt, len(vax19_dead[vax19_dead.VAX_LOT == lot].groupby""STAT"")))
        if i == 10:
            break

The output is as follows:

Unknown 254 34
EN6201 120 30
EN5318 102 26
EN6200 101 22
EN6198 90 23
039K20A 89 13
EL3248 87 17
EL9261 86 21
EM9810 84 21
EL9269 76 18
EN6202 75 18

That concludes this recipe!

There’s more...

The preceding data about vaccines and lots is not completely correct; we will cover some data analysis pitfalls in the next recipe.

In the Introducing Matplotlib for chart generation recipe, we will introduce Matplotlib, a chart library that provides the backend for pandas plotting. It is a fundamental component of Python’s data analysis ecosystem.

See also

The following is some extra information that may be useful:

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform complex bioinformatics analysis using the most essential Python libraries and applications
  • Implement next-generation sequencing, metagenomics, automating analysis, population genetics, and much more
  • Explore various statistical and machine learning techniques for bioinformatics data analysis

Description

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data, and this book will show you how to manage these tasks using Python. This updated third edition of the Bioinformatics with Python Cookbook begins with a quick overview of the various tools and libraries in the Python ecosystem that will help you convert, analyze, and visualize biological datasets. Next, you'll cover key techniques for next-generation sequencing, single-cell analysis, genomics, metagenomics, population genetics, phylogenetics, and proteomics with the help of real-world examples. You'll learn how to work with important pipeline systems, such as Galaxy servers and Snakemake, and understand the various modules in Python for functional and asynchronous programming. This book will also help you explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks, including Dask and Spark. In addition to this, you’ll explore the application of machine learning algorithms in bioinformatics. By the end of this bioinformatics Python book, you'll be equipped with the knowledge you need to implement the latest programming techniques and frameworks, empowering you to deal with bioinformatics data on every scale.

Who is this book for?

This book is for bioinformatics analysts, data scientists, computational biologists, researchers, and Python developers who want to address intermediate-to-advanced biological and bioinformatics problems. Working knowledge of the Python programming language is expected. Basic knowledge of biology will also be helpful.

What you will learn

  • Become well-versed with data processing libraries such as NumPy, pandas, arrow, and zarr in the context of bioinformatic analysis
  • Interact with genomic databases
  • Solve real-world problems in the fields of population genetics, phylogenetics, and proteomics
  • Build bioinformatics pipelines using a Galaxy server and Snakemake
  • Work with functools and itertools for functional programming
  • Perform parallel processing with Dask on biological data
  • Explore principal component analysis (PCA) techniques with scikit-learn

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 27, 2022
Length: 360 pages
Edition : 3rd
Language : English
ISBN-13 : 9781803247724
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Sep 27, 2022
Length: 360 pages
Edition : 3rd
Language : English
ISBN-13 : 9781803247724
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 119.97
Bioinformatics with Python Cookbook
€43.99
Machine Learning in Biotechnology and Life Sciences
€41.99
Deep Learning for Genomics
€33.99
Total 119.97 Stars icon
Banner background image

Table of Contents

14 Chapters
Chapter 1: Python and the Surrounding Software Ecology Chevron down icon Chevron up icon
Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib Chevron down icon Chevron up icon
Chapter 3: Next-Generation Sequencing Chevron down icon Chevron up icon
Chapter 4: Advanced NGS Data Processing Chevron down icon Chevron up icon
Chapter 5: Working with Genomes Chevron down icon Chevron up icon
Chapter 6: Population Genetics Chevron down icon Chevron up icon
Chapter 7: Phylogenetics Chevron down icon Chevron up icon
Chapter 8: Using the Protein Data Bank Chevron down icon Chevron up icon
Chapter 9: Bioinformatics Pipelines Chevron down icon Chevron up icon
Chapter 10: Machine Learning for Bioinformatics Chevron down icon Chevron up icon
Chapter 11: Parallel Processing with Dask and Zarr Chevron down icon Chevron up icon
Chapter 12: Functional Programming for Bioinformatics Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(8 Ratings)
5 star 62.5%
4 star 12.5%
3 star 0%
2 star 12.5%
1 star 12.5%
Filter icon Filter
Top Reviews

Filter reviews by




Paul Darby Oct 16, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you are proficient with python and need a good reference book for bioinformatics. This book covers many of the important applications one may come across in bioinformatics. From basic NCBI I/O applications to NGS this book covers many of the topics with excellent code examples. The book covers several important topics in scientific programming like Machine Learning, NUMPY, PANDAS and DOCKER which are some core tools used in the data sciences.
Amazon Verified review Amazon
Seth Oct 17, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Disclaimer: I was sent a copy of this book to review.I have been working in the bioinformatics industry for 5+ years on all manner of bioinformatics problems. It's a shame this book didn't enter my life sooner. It's both a cookbook, and more than that. Each sections recipes build upon themselves in a cohesive and useful manner. I've found myself just working through a 5-page recipe per day as a method of self study, and gaining exposure to some of the niche facets of bioinformatics that we don't tackle day-to-day.All the tools referenced are up to date, and while the methods for doing things may not always fall into "best practices" they are all good foundations that someone could start with an build upon. I specifically enjoyed the sections at the end about processing data with Dask / Zarr, and the section on population genomics, both of which I was able to make use of in my own work.Overall, I'd recommend this book both to those that have just started down the bioinformatics path and need sample code to get going on tasks (but have a least beginner Python knowledge), and to those who have already been at this a while and just want to see some new and updated ways of doing things.
Amazon Verified review Amazon
Qirui Cui Sep 28, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
There are many books explaining the need for bioinformatics using Python, its methodology, and the myriad designs and implementation pathways that can be taken. The missing book is one that covers all of these from start to finish in a complete, detailed, and comprehensive fashion. From justifying the project, gathering requirements, developing the bioinformatic architectural framework, designing the proper approach for NGS data, integrating the data, generating advanced analytics, dealing with “shadow systems,” understanding and dealing with organizational relationships, managing the full project life cycle, and finally creating centers of excellence—this book covers the entire gambit of creating a sustainable bioinformatic system in Python environment.Mr. Tiago Antao’s deep understanding of technical implementations is only matched by his understanding of the underpinning rudiments behind many of the decision points in the development of the bioinformatic components. These rudiments will help you determine the best deployment options for your specific situation—so invaluable in today’s confusing and mixed messages bioinformatic world!I highly recommend this book to anyone just starting out in bioinformatics using Python particularly, who has a legacy environment that needs renovating or just wants to understand the entire implementation picture from start to finish. Mr. Tiago Antao’s mastery of all the critical implementation activities means you are receiving the best advice for creating a world-class python environment for bioinformatics that will last for long haul. Nicely done, Mr. Tiago Antao.
Amazon Verified review Amazon
Jun, D. Oct 14, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A good entry-level book, covers quite a bit of the most popular packages, such as Biopython, scikit-learn, qiime, etc, covers sequencing, phylogenetics, metagenomics, etc., A good book to get familiar with the bioinformatics, it is especially good for one want to practice both python and bioinformatics, since it provide relatively good coverage for both.
Amazon Verified review Amazon
LadyGator Nov 12, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I was very pleased to receive a review copy of Tiago Antao’s latest edition of the Bioinformatics with Python Cookbook.As an instructor and researcher who works in a bioinformatics core at a university in Boston, I can highly recommend this book as a resource for instruction as well as a practical guide to everyday problems in bioinformatics. Some highlights of the book are the excellent practical exercises which walkthrough common tasks, such as downloading data from NCBI and constructing meaningful plots using matplotlib.I was impressed with the more advanced materials, such as how to access the Galaxy platform using the API and running workflows with snakemake. The book assumes some familiarity with Python code, but even a beginner can follow the logic of the exercises and examples. Many helpful links are provided to freely available resources on the topics that are discussed. I would highly recommend this manual to put on your office bookshelf if you teach or use Python to analyze bioinformatics data.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.