Placing tasks next to the data
The capability to run a task in a specific target location becomes much more useful when it comes to data affinity. This means that if we are going to interact with the distributed data that is being held within the cluster, it is optimal to colocate the task execution close to where the required data is actually held. This will reduce the latency of the task by avoiding the networking cost of having to retrieve the dependency data from the other nodes across the cluster before the processing can actually occur. By making the task PartitionAware
, we can return a key with which the task is going to interact. From this, it is established which partition the key belongs to, and hence, the member node that holds the data. Then, the task will be automatically submitted to execute on the appropriate node to minimize the network latency for the task to obtain or manipulate the data., If we are just interacting with a single data item, another simpler way is to use...