Apache Kafka is an open source streaming platform. If you are reading this book, maybe you already know that Kafka scales very well in a horizontal way without compromising speed and efficiency.
The Kafka core is written in Scala, and Kafka Streams and KSQL are written in Java. A Kafka server can run in several operating systems: Unix, Linux, macOS, and even Windows. As it usually runs in production on Linux servers, the examples in this book are designed to run on Linux environments. The examples in this book also consider bash environment usage.
This chapter explains how to install, configure, and run Kafka. As this is a Quick Start Guide, it does not cover Kafka's theoretical details. At the moment, it is appropriate to mention these three points:
- Kafka is a service bus: To connect heterogeneous applications, we need to implement a message publication mechanism to send and receive messages among them. A message router is known as message broker. Kafka is a message broker, a solution to deal with routing messages among clients in a quick way.
- Kafka architecture has two directives: The first is to not block the producers (in order to deal with the back pressure). The second is to isolate producers and consumers. The producers should not know who their consumers are, hence Kafka follows the dumb broker and smart clients model.
- Kafka is a real-time messaging system: Moreover, Kafka is a software solution with a publish-subscribe model: open source, distributed, partitioned, replicated, and commit-log-based.
There are some concepts and nomenclature in Apache Kafka:
- Cluster: This is a set of Kafka brokers.
- Zookeeper: This is a cluster coordinator—a tool with different services that are part of the Apache ecosystem.
- Broker: This is a Kafka server, also the Kafka server process itself.
- Topic: This is a queue (that has log partitions); a broker can run several topics.
- Offset: This is an identifier for each message.
- Partition: This is an immutable and ordered sequence of records continually appended to a structured commit log.
- Producer: This is the program that publishes data to topics.
- Consumer: This is the program that processes data from the topics.
- Retention period: This is the time to keep messages available for consumption.
In Kafka, there are three types of clusters:
- Single node–single broker
- Single node–multiple broker
- Multiple node–multiple broker
In Kafka, there are three (and just three) ways to deliver messages:
- Never redelivered: The messages may be lost because, once delivered, they are not sent again.
- May be redelivered: The messages are never lost because, if it is not received, the message can be sent again.
- Delivered once: The message is delivered exactly once. This is the most difficult form of delivery; since the message is only sent once and never redelivered, it implies that there is zero loss of any message.
The message log can be compacted in two ways:
- Coarse-grained: Log compacted by time
- Fine-grained: Log compacted by message