Apache Kafka, the open source distributed data streaming software, has just hit version 2.0.0. With Kafka becoming a vital component in the (big) data architecture of many organizations, this new major stable release represents an important step in consolidating its importance for data architects and engineers.
If you're not sure what Kafka is, let's just take a moment to revisit what it does before getting into the details of the 2.0.0 release.
Essentially, Kafka is a tool that allows you to stream, store and publish data. It's a bit like a message queue system. It's used to either move data between different systems between applications (ie. build data pipelines) or develop applications that react in specific ways to streams of data.
Kafka is an important tool because it can process data in real-time. Key to this is the fact it is distributed - things are scaled horizontally, across machines. It's not centralized. As the project website explains, Kafka is "run as a cluster on one or more servers that can span multiple datacenters."
There's a huge range of changes and improvements that have gone live with Kafka 2.0.0. All of these are an attempt to give users more security, stability and reliability in their data architecture. It's Kafka doubling down on what it has always tried to do well.
Here are a few of the key changes:
You can read the details about the release here.