Summary
Deep learning models provide better performance when the training dataset is large (big data). Training models for big data is computationally expensive. This problem can be handled using the divide and conquer approach: we divide the extensive computation part to many machines in a cluster, in other words, distributed AI.
One way of achieving this is by using Google's distributed TensorFlow, the API that helps in distributing the model training among different worker machines in the cluster. You need to specify the address of each worker machine and the parameter server. This makes the task of scaling the model difficult and cumbersome.
This problem can be solved by using the TensorFlowOnSpark API. By making minimal changes to the preexisting TensorFlow code, we can make it run on the cluster. The Spark framework handles the distribution among executor machines and the master, shielding the user from the details and giving better scalability.
In this chapter, the TensorFlowOnSpark...