We have discussed the integration of Apache Kafka with various frameworks, which can be used for real-time or near-real-time streaming. Apache Kafka can store data for a configured time of retention period; the default is seven days.
Data will be removed from Kafka when the retention period expires. The organization does not want to lose data, and in many cases, they need data for some batch processing to generate monthly, weekly, or yearly reports. We can store historical records for further processing into a cheap and fault-tolerant storage system such as HDFS.
Kafka data can be moved to HDFS and can be used for different purposes. We will talk about the following four ways of moving data from Kafka to HDFS:
- Using Camus
- Using Gobblin
- Using Kafka Connect
- Using Flume