With the recent advancements in cluster computing coupled with the rise of big data, the field of machine learning has been pushed to the forefront of computing. The need for an interactive platform that enables data science at scale has long been a dream that is now a reality.
The following three areas together have enabled and accelerated interactive data science at scale:
- Apache Spark: A unified technology platform for data science that combines a fast compute engine and fault-tolerant data structures into a well-designed and integrated offering
- Machine learning: A field of artificial intelligence that enables machines to mimic some of the tasks originally reserved exclusively for the human brain
- Scala: A modern JVM-based language that builds on traditional languages, but unites functional and object-oriented concepts without the verboseness of other languages
First, we need to set up the development environment, which will consist of the following components:
- Spark
- IntelliJ community edition IDE
- Scala
The recipes in this chapter will give you detailed instructions for installing and configuring the IntelliJ IDE, Scala plugin, and Spark. After the development environment is set up, we'll proceed to run one of the Spark ML sample codes to test the setup.