Processing Enron emails with serverless MapReduce
I've based our example application on the Enron email corpus, which is publicly available on Kaggle. This data is made up of some 500,000 emails from the Enron corporation. In total, this dataset is approximately 1.5 GB. What we will be doing is counting the number of From-To emails. That is, for each person who sent an email, we will generate a count of the number of times they sent to a particular person.
Note
Anyone may download and work with this dataset: https://www.kaggle.com/wcukierski/enron-email-dataset. The original data from Kaggle comes as a single file in CSV format. To make this data work with this example MapReduce program, I broke the single ~1.4 GB file into roughly 100 MB chunks. During this example, it's important to remember that we are starting from 14 separate files on S3.
The data format in our dataset is a CSV with two columns, the first being the email message location (on the mail server, presumably) and the second...