Technologies for machine learning projects
In this section, we are going to learn about some of the most famous and useful technologies that you can use in machine learning and big data projects. It is very important to understand the difference between these technologies and how they relate to each other, as in some ways they are similar, but they have important differences.
We are going to look at the following:
- Apache Spark
- Databricks
- Azure Databricks
- MLlib
Apache Spark
In 2012, in order to surpass MapReduce computing limitations, Apache Spark and its RDD (more on this later) were released. In order to understand the change, we have to learn how the MapReduce paradigm works by comparing it to Spark.
MapReduce substantially uses a linear dataflow, because it reads data from the disk and maps a function across the data (this means that for each entry, a function is evaluated and a result is generated). The mapped result is reduced, and the reduced data is stored again on the disk. It is easy to...