Task 1 – Calculating the K most frequent words in a stream of lines of text
In the previous chapter, we wrote a very basic pipeline that computed a simple (but surprisingly frequently used) functionality. The pipeline computed the number of occurrences of a word in a text document. We then transformed this to a data stream of lines, which was generated by a TestStream
utility.
In the first task of this chapter, we want to extend this simple pipeline to be able to calculate and output only the K most frequent words in a stream of lines. So, let's first define the problem.
Defining the problem
Given an input data stream of lines of text, calculate the K most frequent words within a fixed time window of T seconds.
There are many practical applications for solving this problem. For example, if we had a store, we might want to compute daily statistics to find the products with the maximum profit. However, we have chosen the example of counting words in a text stream...