Handling skews in data
A data skew refers to an extreme, uneven distribution of data in a dataset. Let's take an example of the number of trips per month of our Imaginary Airport Cab (IAC) example. Let's assume the data distribution as shown in the following graph:
As you can see from the graph, the trip numbers for November and December are quite high compared to the other months. Such an uneven distribution of data is referred to as a data skew. Now, if we were to distribute the monthly data to individual compute nodes, the nodes that are processing the data for November and December are going to take a lot more time than the ones processing the other months. And if we were generating an annual report, then all the other stages would have to wait for the November and December stages to complete. Such wait times are inefficient for job performance. To make the processing more efficient, we will have...