Summary
In this chapter, we learned how unbounded streams of data can be viewed as time-varying relations and, as such, are suitable to be queried using SQL. We saw how standard SQL needs to be adjusted to fit streaming needs – we introduced three special functions called TUMBLE
, HOP
, and SESSION
to be used in the GROUP BY
clauses of SQL to apply a windowing strategy within SQL statements.
We explored that the prerequisite of applying Apache Beam SQL to PCollection
is to create a PCollection<Row>
, where Row
represents the relational view of a stream, broken down to a structure with a given Schema
, which represents the individual (possibly nested) fields of data elements inside PCollection
. We also learned how to either automatically infer a schema from the given type using the @DefaultSchema
annotation with a SchemaProvider
such as JavaFieldSchema
or JavaBeanSchema
. When we cannot (or do not want to) use a @DefaultSchema
, we can set the schema to a PCollection
manually...