Imagine processing the data stored in a traditional spreadsheet or text-based delimited files such as a CSV file. The type of processing that you will typically execute when using these types of data stores is referred to as batch processing. In batch processing, data is collated into some sort of group, in this case, the collection of lines in our spreadsheet or CSV file, and processed together as a group at some future time and date. Typically, these spreadsheets or CSV files will be refreshed with updated data at some juncture, at which point the same, or similar, processing will be undertaken, potentially all managed by some sort of schedule or timer. Traditionally, data processing systems would have been developed with batch processing in mind, including conventional data warehouses.
Today, however, batch processing alone is not enough. With the advent of the internet, social media, and more powerful technology, coupled with the demand for mass data consumption as soon as possible (ideally immediately), real-time data processing and analytics are no longer a luxury for many businesses but instead a necessity. Examples of use cases where real-time data processing is vital include processing financial transactions and real-time pricing, real-time fraud detection and combating serious organized crime, logistics, travel, robotics, and artificial intelligence.
Micro-batch processing extends standard batch processing by executing at smaller intervals (typically seconds or milliseconds) and/or on smaller batches of data. However, like batch processing, data is still processed a batch at a time.
Stream processing differs from micro-batch and batch processing in the fact that data processing is executed as and when individual data units arrive. Distributed streaming platforms, such as Apache Kafka, provide the ability to safely and securely move real-time data between systems and applications. Thereafter, distributed streaming engines, such as Apache Spark's Streaming library, Spark Streaming, and Apache Storm, allow us to process and analyze real-time data. In Chapter 8, Real-Time Machine Learning Using Apache Spark, we will discuss Spark Streaming in greater detail, where we will develop a real-time sentiment analysis model by combining Apache Kafka with Spark Streaming and Apache Spark's machine learning library, MLlib.
In the meantime, let's quickly take a look into how Apache Kafka works under the hood. Take a moment to think about what kind of things you would need to consider in order to engineer a real-time streaming platform:
- Fault tolerance: The platform must not lose real-time streams of data and have some way to store them in the event of partial system failure.
- Ordering: The platform must provide a means to guarantee that streams can be processed in the order that they are received, which is especially important to business applications where order is critical.
- Reliability: The platform must provide a means to reliably and efficiently move streams between various distinct applications and systems.
Apache Kafka provides all of these guarantees through its distributed streaming logical architecture, as illustrated in Figure 1.11:
Figure 1.11: Apache Kafka logical architecture
In Apache Kafka, a topic refers to a stream of records belonging to a particular category. Kafka's Producer API allows producer applications to publish streams of records to one or more Kafka topics, and its Consumer API allows consumer applications to subscribe to one or more topics and thereafter receive and process the stream of records belonging to those topics. Topics in Kafka are said to be multi-subscriber, meaning that a single topic may have zero, one, or more consumers subscribed to it. Physically, a Kafka topic is stored as a sequence of ordered and immutable records that can only be appended to and are partitioned and replicated across the Kafka cluster, thereby providing scalability and fault tolerance for large systems. Kafka guarantees that producer messages are appended to a topic partition in the order that they are sent, with the producer application responsible for identifying which partition to assign records to, and that consumer applications can access records in the order in which they are persisted.
Kafka has become synonymous with real-time data because of its logical architecture and the guarantees that it provides when moving real-time streams of data between systems and applications. But Kafka can also be used as a stream processing engine as well in its own right via its Streams API, not just as a messaging system. By means of its Streams API, Kafka allows us to consume continual streams of data from input topics, process that data in some manner or other, and thereafter produce continual streams of data to output topics. In other words, Kafka allows us to transform input streams of data into output streams of data, thereby facilitating the engineering of real-time data processing pipelines in competition with other stream processing engines such as Apache Spark and Apache Storm.
In Chapter 8, Real-Time Machine Learning Using Apache Spark, we will use Apache Kafka to reliably move real-time streams of data from their source systems to Apache Spark. Apache Spark will then act as our stream processing engine of choice in conjunction with its machine learning library. In the meantime, however, to learn more about Apache Kafka, please visit https://kafka.apache.org/.