Best Practices
Now that we have performed exercises to get you started with analytics, let's review some best practices of using Spark. While Spark provides significant performance improvements compared with Hadoop and MapReduce, we need to be aware of some of the best practices to fully derive the value that Spark affords us:
- Use
collect
sparingly;collect
will try to fetch all the elements in the memory. To validate whether your dataset can be fit into the memory before you used the collect, it is better to usetake
ortake(n)
so that you control the outcome. GroupByKey
is not very efficient, as it involves significant shuffling around; useReduceByKey
instead, which aggregates and reduces the amount of shuffling around.- Use
filter
as a way of pre-processing to clean up the dataset by dropping bad quality data. Map
can be used as a way of pre-processing and imputing values for bad or missing data.