Read path
Hadoop design is based on the Sequence file format, which is used to append key/value pairs; this stems from the HDFS append-only capability.
This design is retrofitted by a concept of MapFiles and an extension of SequenceFile.
MapFile is nothing but a bundle of two Sequences Files in a directory. The first file is /data
and the second is /index
. This allows us to append key/value pairs and every N
key; we can configure N
as needed. This setup also allows us to store the key and the offset in the index. This gives us the flexibility to do extremely fast lookups as the data and the index have less entries. Once you are aware of the block, data file location can be done at a very fast pace.
MapFile is effective as we can look up keys and the values:
Row Length short |
Row Key Byte[] |
Family length byte |
Column Family byte[] |
Column Qualifier bytes[] |
Timestamp long |
Key Type byte |
Hbase key has the preceding structure: row key, column family, column qualifier, timestamp...