Implementing our first streaming pipeline using SQL
We will follow the same path that we walked when we started playing with the Java SDK of Apache Beam. The very first pipeline we implemented, which was in Chapter 1, Introducing Data Processing with Apache Beam, was a pipeline that read input from a resource file named lorem.txt
. Our goal was to process this file and output the number of occurrences of words within that file. So, let's see how our solution would differ if we used SQL to solve it!
We have implemented the equivalent of com.packtpub.beam.chapter1.FirstPipeline
in com.packtpub.beam.chapter5.FirstSQLPipeline
. The main differences are summarized here:
- First, we need to create a
Schema
that will represent our input. The input is raw lines of text asString
objects, so a possibleSchema
representing it is a single-fieldSchema
defined as follows:Schema lineSchema = Schema.of( Field.of("s", FieldType.STRING));
- We then attach...