Storing data with PyTables
Hierarchical data format (HDF) is a specification and technology for the storage of big numerical data. HDF was created in the supercomputing community and is now an open standard. The latest version of HDF is HDF5 and is the one we will be using. HDF5 structures data in groups and datasets. Datasets are multidimensional homogeneous arrays. Groups can contain other groups or datasets. Groups are like directories in a hierarchical filesystem.
The two main HDF5 Python libraries are as follows:
h5y
PyTables
In this example, we will be using PyTables. PyTables has a number of dependencies:
The NumPy package, which we installed in Chapter 1, Getting Started with Python Libraries
The numexpr package, which claims that it evaluates multiple-operator array expressions many times faster than NumPy can
HDF5
Note
The parallel version of HDF5 also requires MPI. HDF5 can be installed by obtaining a distribution from http://www.hdfgroup.org/HDF5/release/obtain5.html and running the...