Performance optimizations
In addition to partitioning, there are other aspects of how an Apex application is deployed on the cluster that are under the control of the application writer which can be used to further improve performance.
For example, consider consecutive operators opA
and opB
in a DAG. If the former generates a large volume of data on its output port and the latter performs some sort of filtering or aggregation operation so that the volume of data leaving its output port is considerably diminished, it may make sense to co-locate them in the same node to conserve network bandwidth; this is called NODE_LOCAL
locality.
Additionally, tuple serialization and deserialization overhead (which can be considerable in some cases) can be eliminated if they could be co-located within the same container; this is called CONTAINER_LOCAL
locality. The following figure shows different options of co-location of two operators:
These co-locations can be achieved by setting the locality of the appropriate...