Improving the performance of the Spark job
In the previous recipe, we wrote a simple Spark job that filters out invalid geolocations and pushes the valid geolocations into a Kafka topic. In this recipe, we will see how we can improve the performance of our Spark job.
How to do it...
There are several ways in which you can improve the performance of your Spark job. There are a lot many configurations that Spark provides that can be tweaked to achieve desired performance. For example, based on the amount of data that your topic receives, you could change the batch duration of your stream. Also, deploying your Spark job on a Mesos or YARN cluster opens up a lot of opportunities for performance improvement. In fact, running your Spark job in local standalone mode will not help you assess the performance of your Spark job. The real test for a Spark job is when it is executed on a cluster. Each Spark job requires a certain amount of resources for execution, be it CPU or memory.
Earlier in the book...