Packt+ | Advance your knowledge in tech

You're reading from Bioinformatics with Python Cookbook Learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology

Product type Paperback

Published in Nov 2018

Publisher Packt

ISBN-13 9781789344691

Length 360 pages

Edition 2nd Edition

Languages

Python

Tools

Biopython

Concepts

Bioinformatics

Author (1):

Tiago Antao

View More author details

Python can be run on top of different environments. For instance, you can use Python inside the Java Virtual Machine (JVM) (via Jython) or with .NET (with IronPython). However, here, we are concerned not only with Python, but also with the complete software ecology around it; therefore, we will use the standard (CPython) implementation, since the JVM and .NET versions exist mostly to interact with the native libraries of these platforms. A potentially viable alternative would be to use the PyPy implementation of Python (not to be confused with Python Package Index (PyPI).

Save for noted exceptions, we will be using Python 3 only. If you were starting with Python and bioinformatics, any operating system will work, but here, we are mostly concerned with intermediate to advanced usage. So, while you can probably use Windows and macOS, most heavy-duty analysis will be done on Linux (probably on a Linux cluster). Next-generation sequencing (NGS) data analysis and complex machine learning is mostly performed on Linux clusters.

If you are on Windows, you should consider upgrading to Linux for your bioinformatics work because most modern bioinformatics software will not run on Windows. macOS will be fine for almost all analyses, unless you plan to use a computer cluster, which will probably be Linux-based.

If you are on Windows or macOS and do not have easy access to Linux, don't worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual Linux on your operating system. If you are working with Windows and decide that you want to go native and not use Anaconda, be careful with your choice of libraries; you are probably safer if you install the 32-bit version for everything (including Python itself).

If you are on Windows, many tools will be unavailable to you.

Bioinformatics and data science are moving at breakneck speed; this is not just hype, it's a reality. When installing software libraries, choosing a version might be tricky. Depending on the code that you have, it might not work with some old versions, or maybe not even work with a newer version. Hopefully, any code that you use will indicate the correct dependencies—though this is not guaranteed.

The software developed for this book is available at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition. To access it, you will need to install Git. Alternatively, you can download the ZIP file that GitHub makes available (indeed, getting used to Git may be a good idea because lots of scientific computing software is being developed with it).

Before you install the Python stack properly, you will need to install all the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter, and all chapter-specific packages will be explained in their respective chapters. Some less common Python libraries may also be referred to in their specific chapters. Fortunately, since the first edition of this book, most bioinformatics software can be easily installed with conda using the Bioconda project.

If you are not interested in a specific chapter, you can skip the related packages and libraries. Of course, you will probably have many other bioinformatics applications around—such as Burrows-Wheeler Aligner (bwa) or Genome Analysis Toolkit (GATK) for NGS—but we will not discuss these because we do not interact with them directly (although we might interact with their outputs).

You will need to install some development compilers and libraries, all of which are free. On Ubuntu, consider installing the build-essential package (apt-get it), and on macOS, consider Xcode (https://developer.apple.com/xcode/).

In the following table, you will find a list of the most important Python software:

Name	Application	URL	Purpose
Project Jupyter	All chapters	https://jupyter.org/	Interactive computing
pandas	All chapters	https://pandas.pydata.org/	Data processing
NumPy	All chapters	http://www.numpy.org/	Array/matrix processing
SciPy	All chapters	https://www.scipy.org/	Scientific computing
Biopython	All chapters	https://biopython.org/	Bioinformatics library
PyVCF	NGS	https://pyvcf.readthedocs.io	VCF processing
Pysam	NGS	https://github.com/pysam-developers/pysam	SAM/BAM processing
HTSeq	NGS/Genomes	https://htseq.readthedocs.io	NGS processing
simuPOP	Population genetics	http://simupop.sourceforge.net/	Population genetics simulation
DendroPY	Phylogenetics	https://dendropy.org/	Phylogenetics
scikit-learn	Machine learning/population genetics	http://scikit-learn.org	Machine learning library
PyMol	Proteomics	https://pymol.org	Molecular visualization
rpy2	Introduction	https://rpy2.readthedocs.io	R interface
seaborn	All chapters	http://seaborn.pydata.org/	Statistical chart library
Cython	Big data	http://cython.org/	High performance
Numba	Big data	https://numba.pydata.org/	High performance
Dask	Big data	http://dask.pydata.org	Parallel processing

We have taken a somewhat conservative approach in most of the recipes with regard to the processing of tabled data. While we use pandas every now and then, most of the time, we use standard Python. As time advances and pandas becomes more pervasive, it will probably make sense to just process all tabular data with it (if it fits in-memory).

Take a look at the following steps to get started:

Start by downloading the Anaconda distribution from https://www.anaconda.com/download. Choose Python version 3. In any case, this is not fundamental, because Anaconda will let you use Python 2 if you need it. You can accept all the installation defaults, but you may want to make sure that the conda binaries are in your path (do not forget to open a new window so that the path is updated). If you have another Python distribution, be careful with your PYTHONPATH and existing Python libraries. It's probably better to unset your PYTHONPATH. As much as possible, uninstall all other Python versions and installed Python libraries.
Let's go ahead with the libraries. We will now create a new conda environment called bioinformatics with biopython=1.70, as shown in the following command:

conda create -n bioinformatics biopython biopython=1.70

Let's activate the environment, as follows:

source activate bioinformatics

Let's add the bioconda and conda-forge channel to our source list:

conda config --add channels bioconda
conda config --add channels conda-forge

Also, install the core packages:

conda install scipy matplotlib jupyter-notebook pip pandas cython numba scikit-learn seaborn pysam pyvcf simuPOP dendropy rpy2

Some of them will probably be installed with the core distribution anyway.

We can even install R from conda:

conda install r-essentials r-gridextra

r-essentials installs a lot of R packages, including ggplot2, which we will use later. We also install r-gridextra, since we will be using it in the Notebook.