Summary
In this chapter, we first walked through the steps needed to set up our environment to run the code located at this book's GitHub. We created a minikube cluster and ran Apache Kafka and Apache Flink on top of it. We then found out how to use the scripts located on GitHub to create topics in Kafka and publish messages to them, and how to consume data from topics.
After we walked through the necessary infrastructure, we jumped directly into implementing various practical tasks. The first one was to calculate the K most frequent words in a stream of text lines. In order to accomplish this, we learned how to use the Count
and Top
transforms. We also learned how to use the TestStream
utility to create a simulated stream of input data and use this to write a test case that validates our pipeline implementation. Then, we learned how to deploy our pipeline to a real runner – Apache Flink.
We then got acquainted with another grouping transform – Max
, which we...