Introduction
In this chapter, we will discuss high-performance computing techniques for large computational biology datasets. We will talk about efficient data storage, code parallelization, running software in clusters, and code optimization. We will try to avoid any solution to a specific proprietary technology (for example, Amazon EC2) and will instead design code that will be applicable in a wide range of scenarios.
The previous edition of this book had some recipes that compared lazy and eager data structures. This made sense, as Python 2 was mostly eager and Python 3 is mostly lazy. As Python 2 is behind us, that content has been dropped. That being said, make sure that you understand the difference and that your code is mostly lazy. Be sure to check generators in Python. Use them.
As the sizes of the datasets is constantly increasing, in this edition, we cannot evade discussing the efficient storage of bioinformatics data, and so we will discuss the Hierarchical Data Format (HDF5) and...