Introducing data streams
Data streams are a way of storing a sequence of messages. They enable us to design systems where we think about state as a series of events instead of only entities and values, or rows and columns in a database. This shift in mindset and technology enables real-time analytics to extract the value from data by acting on it before it is stale. They also enable organizations to design and develop resilient software based on microservice architectures by helping them to decouple systems. We will begin with an overview of streaming data sources, why real-time data analysis is valuable, and how they can be used architecturally to decouple systems. We will then review the core challenges associated with distributed systems, and conclude with an overview of key messaging concepts and some high-level examples. Messages can contain a wide variety of information and come from different sources, so let's look at the primary sources and data formats.
Sources of data
The proliferation of data steadily increases from sources such as social media, IoT devices, web clickstreams, application logs, and video cameras. This data poses challenges to most systems, since it is typically high-velocity, intermittent, and bursty, making it difficult to adequately provision and design downstream systems. Payloads are generally small, except when containing audio or video data, and come in a variety of formats.
In this book, we will be focusing on three data formats. These formats include the following:
- JavaScript Object Notation (JSON)
- Log files
- Time-encoded binary files such as video
JSON streams
JSON has become the dominant format for message serialization over the past 10 years. It is a lightweight data interchange format that is easy for humans to read and write and is based on the JavaScript object syntax. It has two data structures – hash tables and lists. A hash table consists of key-value pairs, {"key":"value"}
, where the keys must be unique. A list is a set of values in a specific order, ["value 1"
, "value 2"
]. The following code sample shows a sample IoT JSON message:
{ "deviceid" : "device001", "eventTime": -192778200, "temp" : 68.4, "humidity" : 77.3, "coords" : { "latitude" : 32.779039, "longitude" : -96.808660 } }
Log file streams
Log files come in a variety of formats. Common ones include Apache Commons Logging, Apache Combined Log, Apache Error Log, and RFC3164 Syslog. They are plain text, and usually each line, delineated by a newline ('\n')
character, is a separate log entry. In the following sample log, we see an HTTP GET
request where the IP address is 10.13.37.01
, the datetime of the request, the HTTP verb, the URL fragment, the HTTP version, the response code, and the size of the result.
The sample log line in Apache Commons Logging format is as follows:
10.13.37.01 - - [03/Sep/2017:12:00:01 +0830] "GET /mailman/listinfo/test HTTP/1.1" 200 2457
Time-encoded binary streams
Time-encoded binary streams consist of a time series of records where each record is related to the adjacent records (prior and subsequent records). These can be used for a wide variety of sensor data, from audio streams and RADAR signals to video streams. Throughout this book, the primary focus will be video streams and their applications.
As shown in Figure 1.1, video streams are composed of fragments, where each fragment is a self-contained sequence of media frames. There are no dependencies between fragments. We will discuss video streams in more detail in Chapter 7, Kinesis Video Streams. Now that we've covered the types of data that we'll be processing, let's take a step back to understand the value of real-time data in analytics.