Reference
- Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia, O'Reilly, provides a much more complete introduction to Spark that this chapter can provide. I thoroughly recommend it.
- If you are interested in learning more about information theory, I recommend David MacKay's book Information Theory, Inference, and Learning Algorithms.
- Information Retrieval, by Manning, Raghavan, and Schütze, describes how to analyze textual data (including lemmatization and stemming). An online
- On the Ling-Spam dataset, and how to analyze it: http://www.aueb.gr/users/ion/docs/ir_memory_based_antispam_filtering.pdf.
- This blog post delves into the Spark Web UI in more detail. https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html.
- This blog post, by Sandy Ryza, is the first in a two-part series discussing Spark internals, and how to leverage them to improve performance: http://blog.cloudera.com/blog/2015/03/how-to-tune...