The Spark ecosystem
Apache Spark powers a number of tools, both as a library and as an execution engine.
Spark Streaming
Spark Streaming (found at http://spark.apache.org/docs/latest/streaming-programming-guide.html) is an extension of the Scala API that allows data ingestion from streams such as Kafka, Flume, Twitter, ZeroMQ, and TCP sockets.
Spark Streaming receives live input data streams and divides the data into batches (arbitrarily sized time windows), which are then processed by the Spark core engine to generate the final stream of results in batches. This high-level abstraction is called DStream (org.apache.spark.streaming.dstream.DStreams
) and is implemented as a sequence of RDDs. DStream allows for two kinds of operations: transformations and output operations. Transformations work on one or more DStreams to create new DStreams. As part of a chain of transformations, data can be persisted either to a storage layer (HDFS) or an output channel. Spark Streaming allows for transformations...