Summary
The most important thing to keep in mind when working with large datasets and matplotlib is to use data wisely and take advantage of NumPy and tools such as PyTables. When moving to distributed data, a large burden with regard to infrastructure is taken on compared to working with data on a single machine. As datasets approach terabytes and petabytes, the greatest work involved really has less to do with plotting and visualization and has more to do with deciding what to visualize and how to actually get there. An increasingly common aspect of big data is real-time analysis, where matplotlib might be used to generate hundreds or thousands of plots of a fairly small set of data points. Not all problems in big data visualization are about visualizing big data!
Finally, it cannot be overstated that knowing your data is the most crucial component when tackling large datasets. It is very rare that an entire raw dataset is what you want to present to your colleagues or end users. Gaining...