Summary
In this chapter, we looked at ways to handle very large amounts of data and special tools for doing this such as Apache Hadoop MapReduce and Apache Spark. We saw how to use them to process Common Crawl - the copy of the Internet, and calculate some useful statistics from it. Finally, we created a Link Prediction model for recommending coauthors and trained an XGBoost model in a distributed way.
In the next chapter, we will look at how Data Science models can be deployed to production systems.