Understanding data skewing, indexing, and partitioning
Like with any data processing system, all of the greatest hardware will only produce mediocre results. There is no magic bullet that will solve poor data layouts. The fastest disk, processing chips, and network will not negate the need to plan for well-thought-out indexing and partitioning strategies. Data skew can sneak into processing pipelines or queries and bring them to a crawl. These three critical aspects need to be planned for and monitored to prevent degradation to data processing and querying. We’ll learn more about them in the following sections.
Data skew
Data skew is a common problem when utilizing distributed data systems such as Apache Spark. It will show up when some processing partitions are significantly larger than others, resulting in some tasks finishing quickly while waiting for others to complete. This can result in under-utilized compute, long processing times, and out-of-memory errors. Joins...