Many of us have been working on many big data projects and have used a wide range of frameworks and tools to solve customer problems. Bringing the data to distributed storage is the first step of data processing. If you have ever observed that in the case of Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT), the first step is to extract the data and bring it in for processing. A storage system has a cost associated with it and we always want to store more data in less storage space. The big data processing happens over massive amounts of data, which may cause I/O and network bottlenecks. The shuffling of data across the network is always a painful, time-consuming process that burns significant amounts of processing time.
Here is how compression can help us in different ways:
- Less storage: A storage system comes with a significant amount...