Data serialization in Hadoop
Though we see data in a structured form, the raw form of data is a sequence or stream of bits. This raw form of data is the one that travels over the network and is stored in RAM or any other persistent media. Serialization is the process of converting structured data into its raw form. Deserialization is the reverse process of reconstructing structured forms from the data's raw bit stream form.
In Hadoop, different components talk to each other via Remote Procedure Calls (RPCs). A caller process serializes the desired function name and its arguments as a byte stream before sending it to the called process. The called process deserializes this byte stream, interprets the function type, and executes it using the arguments that were supplied. The results are serialized and sent back to the caller. This workflow naturally calls for fast serialization and deserialization. Network bandwidth is at a premium and requires the serialized representation of the function...