Learning about memory cartography
One draw of distributed systems such as Apache Spark is the ability to process very large datasets quickly. Sometimes, the dataset is so large that it can’t even fit entirely in memory on a single machine! Having a distributed system that can break up the data into chunks and process them in parallel is then necessary since no individual machine would be able to load the whole dataset in memory at one time to operate on it. But what if you could process a huge, multiple-GB file while using almost no RAM at all? That’s where memory mapping comes in.
Let’s look to our NYC Taxi dataset once again for help with demonstrating this concept. Let’s download the yellow taxi data for January 2015. The file we get is named yellow_tripdata_2015-01.parquet
and is approximately 168 MB in size. If we convert it to a CSV file, it is around 1.3 GB and perfect to use as an example. For brevity, we’ll use the Python Arrow library...