Hadoop stores data in blocks of 64/128/256 MB. Hadoop also detects many of the common file formats and deals accordingly when stored. It supports compression, but the compression methodology can support splitting and random seeks, but in a non-splittable format. Hadoop has a number of default codecs for compression. They are as follows:
- File-based: It is similar to how you compress various files on your desktop. Some formats support splitting while some don't, but most of these be persisted in Hadoop. This codec compresses the whole file as is, that too, any file format coming its way.
- Block-based: As we know, data in Hadoop is stored in blocks, and this codec compresses each block.
However, compression increases CPU utilization and also degrades performance. Hadoop supports a variety of traditional file formats to be stored. However, Hadoop does very specific filesystem for data, as shown...