Feature engineering on a stream
Before diving into feature engineering on a stream, we want to clarify the difference between streaming pipelines and streaming data. If you have not used Spark Structured Streaming before, it is a stream processing engine built on the Spark SQL engine. It makes it easy to write streaming calculations or transformations like you would write expressions for static data. Structured Streaming pipelines can process batch or streaming data. Streaming pipelines have elements such as checkpoints to automate the data flow. Streaming pipelines, however, are not necessarily always running; rather, they only run when the developer chooses it. In contrast, streaming data (also known as real-time data) refers to continuously generated data that can be processed in real time or batch. To simplify, think of streaming pipelines as a series of automated conveyor belts in a factory set up to process items (data) as they come. These conveyor belts can be turned on or off...