Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Scalable Data Streaming with Amazon Kinesis
Scalable Data Streaming with Amazon Kinesis

Scalable Data Streaming with Amazon Kinesis: Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis

Arrow left icon
Profile Icon Makota Profile Icon Brian Maguire Profile Icon Gagne Profile Icon Chakrabarti
Arrow right icon
$48.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (4 Ratings)
Paperback Mar 2021 314 pages 1st Edition
eBook
$24.99 $35.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Makota Profile Icon Brian Maguire Profile Icon Gagne Profile Icon Chakrabarti
Arrow right icon
$48.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (4 Ratings)
Paperback Mar 2021 314 pages 1st Edition
eBook
$24.99 $35.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$24.99 $35.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Scalable Data Streaming with Amazon Kinesis

Chapter 1: What Are Data Streams?

A data stream is a system where data continuously flows from multiple sources, just like water flows through a stream. The data is often produced and collected simultaneously in a continuous flow of many small files or records. Data streams are utilized by a wide range of business, medical, government, social media, and mobile applications. These applications include financial applications for the stock market and e-commerce ordering systems that collect orders and cover fulfillment of delivery.

In the entertainment space, live data is produced by sensing devices embedded in player equipment, video game players generate large amounts of data at a massive scale, and there are new social media posts thousands of times per second. Governments also leverage streaming data and geospatial services to monitor land, wildlife, and other activities.

Data volume and velocity are increasing at faster rates, creating new challenges in data processing and analytics. This book will detail these challenges and demonstrate how Amazon Kinesis can be used to address them. We will begin by discussing key concepts related to messaging in a technology-agnostic form to provide a solid foundation for building your Kinesis knowledge.

Incorporating data streams into your application architecture will allow you to deliver high-performance solutions that are secure, scalable, and fast. In this chapter, we will cover core streaming concepts so that you will have a detailed understanding of their application to distributed systems. You will learn what a data stream is, how to leverage data streams to scale, and examine a number of high-level use cases.

This chapter covers the following topics:

  • Introducing data streams
  • Challenges associated with distributed systems
  • Overview of messaging concepts
  • Examples of data streaming

Introducing data streams

Data streams are a way of storing a sequence of messages. They enable us to design systems where we think about state as a series of events instead of only entities and values, or rows and columns in a database. This shift in mindset and technology enables real-time analytics to extract the value from data by acting on it before it is stale. They also enable organizations to design and develop resilient software based on microservice architectures by helping them to decouple systems. We will begin with an overview of streaming data sources, why real-time data analysis is valuable, and how they can be used architecturally to decouple systems. We will then review the core challenges associated with distributed systems, and conclude with an overview of key messaging concepts and some high-level examples. Messages can contain a wide variety of information and come from different sources, so let's look at the primary sources and data formats.

Sources of data

The proliferation of data steadily increases from sources such as social media, IoT devices, web clickstreams, application logs, and video cameras. This data poses challenges to most systems, since it is typically high-velocity, intermittent, and bursty, making it difficult to adequately provision and design downstream systems. Payloads are generally small, except when containing audio or video data, and come in a variety of formats.

In this book, we will be focusing on three data formats. These formats include the following:

  • JavaScript Object Notation (JSON)
  • Log files
  • Time-encoded binary files such as video

JSON streams

JSON has become the dominant format for message serialization over the past 10 years. It is a lightweight data interchange format that is easy for humans to read and write and is based on the JavaScript object syntax. It has two data structures – hash tables and lists. A hash table consists of key-value pairs, {"key":"value"}, where the keys must be unique. A list is a set of values in a specific order, ["value 1", "value 2"]. The following code sample shows a sample IoT JSON message:

{
    "deviceid" : "device001",
    "eventTime": -192778200,
    "temp" : 68.4,
    "humidity" : 77.3,
    "coords" : {
        "latitude" : 32.779039,
        "longitude" : -96.808660
    }
}

Log file streams

Log files come in a variety of formats. Common ones include Apache Commons Logging, Apache Combined Log, Apache Error Log, and RFC3164 Syslog. They are plain text, and usually each line, delineated by a newline ('\n') character, is a separate log entry. In the following sample log, we see an HTTP GET request where the IP address is 10.13.37.01, the datetime of the request, the HTTP verb, the URL fragment, the HTTP version, the response code, and the size of the result.

The sample log line in Apache Commons Logging format is as follows:

10.13.37.01 - - [03/Sep/2017:12:00:01 +0830] "GET /mailman/listinfo/test HTTP/1.1" 200 2457

Time-encoded binary streams

Time-encoded binary streams consist of a time series of records where each record is related to the adjacent records (prior and subsequent records). These can be used for a wide variety of sensor data, from audio streams and RADAR signals to video streams. Throughout this book, the primary focus will be video streams and their applications.

Figure 1.1 – Time-encoded video data

Figure 1.1 – Time-encoded video data

As shown in Figure 1.1, video streams are composed of fragments, where each fragment is a self-contained sequence of media frames. There are no dependencies between fragments. We will discuss video streams in more detail in Chapter 7, Kinesis Video Streams. Now that we've covered the types of data that we'll be processing, let's take a step back to understand the value of real-time data in analytics.

The value of real-time data in analytics

Analysis is done to support decision making by individuals, organizations, or computer programs. Traditionally, data analysis has been done on batches of data, usually in long-running jobs that occur overnight and that happen periodically at predetermined times: nightly, weekly, quarterly, and so on. This not only limits the scope of actions available to decisions makers, but it is also only providing them with a representation of the past environment. Information is now available seconds after it is produced, so we need to design systems that provide decision makers with the freshest data available to make timely decisions.

The OODAObserve, Orient, Decide, Act – loop is a decision-making, conceptual framework that describes how decisions are made when reacting to an event. By breaking it down into these four components, we can optimize each to reduce the overall cycle time. The key idea is that if we make better decisions quicker than our opponent, we can outmaneuver them and win. By moving from batch to real-time analytics, we are reducing the observed portion of this cycle.

John Boyd

John Boyd was a USAF colonel and military strategist. He developed the OODA loop to better understand pilot combat operations. It has since been expanded and is used at a more strategic level by the military, sports teams, and businesses.

By reducing the OODA loop cycle time, new actions become available. They can be taken while events are unfolding and not merely responding to them after the event has occurred. These time-critical decisions can range from responding to security log anomalies to providing customer recommendations based on a user's recently viewed items. These actions are extremely valuable because they allow us to quickly respond to changing events and are only possible because we can process the data in near real time. The following diagram, inspired by the Perishable Insights report by Mike Gualtieri, shows how time to action correlates to the data's perishability. Each insight has a corresponding action that can only be taken if the data is processed quickly enough – before the insight perishes:

Figure 1.2 – Perishable insights

Figure 1.2 – Perishable insights

The preceding diagram uses shopping as an example to highlight the key distinction between time-critical and historical analysis. Combining historical data and recent data is extremely valuable since it allows deeper insights and can be used to detect patterns and anomalies. The goal of stream analysis is to reduce the amount of time between an event occurring and the appropriate response.

Decoupling systems

A distributed system is composed of multiple networked servers that work together by sending messages between each other. They allow applications to be built that require more compute, storage, or resiliency than is available on a single instance. Some common distributed systems are the World Wide Web, distributed databases, and scientific computing clusters. Distributed systems are often fractal. For example, the three-tier web application, perhaps the most common architecture you will see in the wild, is often constructed of distributed databases, log analysis systems, and payment providers.

The need for distributed systems has increased dramatically over the past 10 years. There are three primary drivers for this: data scale, computational requirements, and organization design and coordination. At first, these systems were brittle and challenging to manage, but over time, certain key patterns emerged that have enabled them to scale by reducing complexity.

The first key in managing complexity was adopting standardized interfaces and common data formats and encodings. This allowed the development of microservice-based architectures where different teams could manage functionality and provide it as a service to the rest of the organization. This reduced the amount of coordination among teams and allowed them to iterate and release at their own appropriate speed, thereby acknowledging and leveraging Conway's Law.

Conway's Law

In 1967, Melvin Conway stated: "Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." This is based on the observation that people need to communicate in order to design and develop systems. When this is applied to microservices, it allows the groups to own their services directly and explicitly model the organization/communication/software architecture correspondence.

The second was to separate the program into different fault domains by moving to a loosely coupled architecture. This is often achieved by having one system send another system a message. However, messages being sent from one fault domain to another made it difficult to reason and understand the complex failure modes of these systems. By introducing asynchronous message brokers, we can define clear boundaries between different fault domains, making it possible to reason about them. The message queue acts as an invariant in the system. It provides a clean interface where it can send messages and retrieve them. If another system is unavailable, the message broker will be able to cache the messages, called a backlog, and that system is responsible for handling them when it resumes service.

There are still many challenges to the design, deployment, and orchestration of these decoupled systems. However, the introduction of modern highly available message brokers has been key in reducing their complexity.

Now that we've seen how asynchronous messaging can separate fault domains, let's learn how they fit into distributed systems.

Challenges associated with distributed systems

The fundamental challenge of distributed systems is intra-system communication. When possible, a messaging system can provide a core decoupling function, allowing intermittent and transient failures not to cascade or cross fault boundaries. These systems must be highly available, scalable, and durable. The following core concepts are essential to understand and reason about these systems: transactions per second, scaling, latency, and high availability. They allow us to understand the system's key dynamics so that resources can be provisioned to support the workload.

Transactions per second

The most important metric for all messaging systems is Transactions Per Second (TPS). This metric is not as simple as it may seem initially, as the maximum TPS is constrained by either a discrete number of transactions or the maximum size of data that can be processed. This max TPS is called capacity. In general, messaging systems have different capacity for the inbound side and the outbound side, with the outbound side normally having a greater capacity to support multiple consumers and prevent large message backlogs.

Backpressure refers to a system state in which the producer TPS is higher than the consumer TPS. The input is coming in faster than it can be processed. There are multiple strategies for handling backpressure. The easiest is to reduce the number of messages being sent, for example, having a temperature sensor send data once a minute instead of once a second. The second is to scale the compute for the consumers to increase the consumer TPS. If the flow of messages is intermittent or bursty, a buffer can temporarily hold the messages and allow the consumers to catch up. Buffers are often used in conjunction with scaling to store messages while compute is scaled up. The last method is to drop messages. Depending on the message type, this can be unacceptable – you don't want to drop customer orders – but, in the case of sensor data, sampling, can be used to process a fixed percentage of data, for example, 5% of data.

Scaling

Messaging systems need to present an access point that hides the complexity of the internal system. In general, messaging systems consist of multiple independent channels and shards. A shard is an independent unit of capacity. This internal complexity cannot be completely hidden from users since the way data is distributed to the different shards needs to be understood by both senders and receivers. Scaling is used to increase the capacity of the messaging system. One way to think about scaling is to consider cables supporting a bridge. If it has 5 strands, it can support 50 tons, and with 50 strands it can support 500 tons. Thus, one unit of scaling for bridges would be the number of cable strands.

As a system scales, its ability to maintain the global order of records becomes limited. In general, order will only be maintained at a sub-global group level. This is an important design consideration that must be covered when designing real-time solutions. If global order is needed, there are fundamental limits on the system's maximum throughput, and if the throughput required is higher, the system will have to be rearchitected.

Etymology of the shard

In 1997, the game Ultima Online was released. In order to reduce latency and handle scale, there were multiple servers around the world that a player could log in to. Each server functioned independently and existed as its own universe in a multi-verse. This was explained in-game by the wizard Mondain capturing the world of Sosaria in the Gem of Immortality. This gem was then shattered by the Avatar into multiple shards. The player then selected which shard they wanted to play in. The term shard, or sharding, is another way to talk about the horizontal partitioning of data, that is, spreading data across multiple servers.

Latency

Latency is the amount of time between a cause and effect in a system. In the context of a messaging system, there are multiple measures of latency that are important in understanding its behavior. In general, it is the time between when a message enters the system and when the message leaves the system. For example, it can be thought of as the time between pressing the brakes in a car and the vehicle stopping. Some workloads, for example, real-time audio/video communication, are especially latency-sensitive and care must be taken to minimize this across all aspects of the system.

The two primary measures of latency in a messaging system are propagation delay and age of message.

The propagation delay is the amount of time from when a message is written to the message broker to when it is read by the consumer application. In most cases, propagation delay is a reflection of how often producers or consumers are polling the message broker. Network effects on the producer's connection to the message system and the acknowledgment of putting a message are known as producer latency, and correspondingly, the time it takes for a request to complete on the consumer side is consumer latency.

The last measure of latency that is extremely important is understanding how long a message has been in the system before it is retrieved, that is, the age of the message. If the average age of messages is increasing, that indicates a backlog and means that messages are being added faster than they are being retrieved.

Fault tolerance/high availability

Messaging systems are foundational to modern distributed systems and need to be designed in such a way to be highly available.

"Everything fails all the time."

– Werner Vogel, Amazon CTO

The preceding quote hints at the difficulty of building highly available systems. To avoid single points of failure, redundancy is required, and messages, once acknowledged, need to be durably stored. Even though messaging systems present a simple interface, to achieve this level of performance, they are actually comprised of many systems configured as a cluster.

Now that we have the vocabulary to talk about inter-system communication, let's introduce the components of messaging systems.

Overview of messaging concepts

In this section, we will review the concept of message brokers in a high-level, implementation-agnostic manner. First, we will go over the core components of all messaging systems and then we will review some key terminology and concepts related to their use.

Overview of core messaging components

There are four components in all messaging systems: producers, consumers, streams, and messages. The following diagram shows a logical breakdown of producers sending messages to a stream, the stream buffering them, and consumers receiving them:

Figure 1.3 – High-level view of messaging

Figure 1.3 – High-level view of messaging

Despite this design's relative simplicity, there is a substantial amount of configuration and optimization that is possible. Now, let's dive a little deeper into each component.

Streams

The stream is the system that stores the messages or records sent by the producers and retrieved by the consumers. They can be ordered in a First In First Out (FIFO) model. Messages in the stream that have been received, but not yet retrieved, are referred to as a backlog.

The retention period is the length of time that the records are accessible after they are added to the stream. This is the maximum size the backlog can be, and it is also the maximum time a new, slow, or intermittent consumer can access the records.

Messages (records)

A message consists of a payload and header information. The header information consists of information set by the producer, and it includes a unique identifier assigned by the message broker when it is inserted into the stream. In general, messages are relatively small, in the order of kilobytes, and messaging systems generally have a maximum payload size.

Producers

The producer is an application that is the source of data that will go into the message or record. It connects to the message broker and puts the data into the stream. There can often be multiple producers sending data to the same message broker.

Consumers

The consumer is an application that receives the messages that are sent by the producer. It connects to the message broker and retrieves the data from the stream. The responsibility for keeping track of the last read message, so that the consumer can retrieve the next message, can be handled either by the message broker (RabbitMQ or SQS) or by the consumer (Kinesis or Kafka). There can be multiple consumers for a message broker.

Real-time analytics

When thinking about real-time analytics, it can be useful to expand it from the producer, stream, consumer model to a five-stage model (Figure 1.3): 1) source of data; 2) data ingestion mechanisms; 3) stream storage; 4) real-time stream processing; and 5) destination, data sink, or action. This model helps us elevate our thinking from the structural communication level to the data processing level. For instance, filtering can be applied at every stage to reduce compute downstream.

The source of data refers to where the data is coming from. For example, it could be mobile devices, web clickstreams, log analytics, IoT devices, or smart devices. Once you have the source, the data needs to be ingested into the stream. This requires a solution that can capture data coming from hundreds of thousands of devices, in a scalable and reliable manner, into a stream for analysis. You then need a platform that can reliably and durably store the data while simultaneously reading from any point in the stream. This refers to the stream storage platform. The stored data is then processed by real-time applications to generate actionable insights, perform actions, and execute real-time extract-transform-load (ETL) operations that deliver the stream of data to an end destination, such as a data lake.

Next, let's see how systems can be designed in a resilient manner.

Messaging concepts

While relatively simple, the implementation of the four components can be nuanced. In all networked systems, failure is complicated. Every network call can have issues, and the systems need to be resilient to handle them. In the following sub-sections, we will review a few key concepts related to resilient systems and also a few advanced stream processing features.

Here are eight fallacies associated with distributed computing. In 1994, Peter Deutsch identified the fact that everyone who builds distributed systems initially gets into trouble by making the assumptions listed here:

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.
  • The network is secure.
  • The topology doesn't change.
  • There is one administrator.
  • The cost of transport is zero.
  • The network is homogeneous (added by James Gosling in 1997).

    Note

    All systems should be designed with those fallacies in mind, and with special attention to the unreliability of the network. Systems that don't properly handle these issues will exhibit complicated and confusing behavior as well as error modes that are challenging to debug.

Timeouts

Timeouts allow for efficient allocation of resources and help prevent cascading failures. If an individual process has an error, it can fail to return a value and hang. In this case, the client may continue to wait indefinitely for a response. Timeouts help prevent server resource exhaustion by ending the connection after a maximum amount of time has passed. This allows the server to free up limited resources, for example, memory, connections, and ports, and use them to handle new requests. The client can retry the request again.

Retries

Many errors are ephemeral, and merely retrying the exact same request again will succeed. In order for retries to be safe, the system handling them must be idempotent, meaning that it is designed in such a way that the same input will cause the same side effects. At a more systemic level, to prevent a server from being inundated with retry requests, each client should implement back-off and jitter.

Back-off is the process of increasing the time between subsequent retries. Jitter is the process of adding a bit of random delay to retries. Together, these two mechanisms spread out message requests over time so that the server is able to handle the number of requests.

When a producer has to retry due to a timeout, it will send the request again. There is the possibility that a duplicate record could be created. If a record should only be processed once, it is important that the payload of the record has a unique ID that the final system can use to remove duplicates. When a consumer fails, it can fetch the same records again. Consumer retries tend to happen more often than producer retries. It is up to the final application to handle the message payload data properly and in an idempotent manner.

Backlogs

A backlog is the number of messages that the stream contains that have yet to be received by a consumer. Backlogs occur when the number of messages a producer sends into a stream is higher than the number of messages received by a consumer. This often happens when the system consuming the messages has an error and the messages keep being added to the stream. This can quickly go from a small backlog to a large backlog. Large backlogs increase the overall system latency by a large amount as the backlog must be processed before the recently arriving messages are processed. This typically results in a bimodal distribution of message latencies, where the latency is low when the system is working correctly and high when the system is having errors.

Large backlogs are a hidden risk that need to be considered when designing asynchronous applications because they can increase the recovery time following an outage – that is, instead of merely restarting the system and it being down for a brief period of time, the system has to work through the large backlog before it can function properly again.

Dead letter queues

Dead letter queues store messages that cannot be processed correctly by the message broker for some reason or another. It could be that it is an invalid message, it is too big, or, for some reason, it fails a certain number of retry attempts. It is important to periodically review dead letter queues because they represent errors in the system.

Replay

Replay is the ability to read, or replay, the same records in the same order multiple times. This means that a new consumer can be added and re-read messages that have already been consumed. Replay is limited to data in the stream. Data is aged out of the stream after it has existed for a specified period of time, for example, 1 hour, 1 day, or 7 days. This retention period affects the amount of storage required to support the stream.

Record processing

When processing records, there are multiple approaches depending on the type of data in the payload and the type of analysis required. In the simplest of systems, each record is processed one at a time, that is, record by record. A more complicated approach is to aggregate records by a sliding time window, where records are accessed by the consumer over a period of time, for instance, calculating the highest, lowest, and the average message value over the last 10 seconds.

Filtering

Filtering allows consumers to receive only the messages that they are interested in. This reduces the amount of data that is needed to be processed and transmitted, which helps the system scale. Messages can be filtered at multiple stages: source, ingestion, stream storage, stream processing, and in the consumer stage. In general, it is best to filter messages as early in the five-stage model as possible as it reduces compute and storage requirements in all subsequent stages. Filtering is determined by the message contents, the source, or the destination. For instance, the producer can send different types of messages to different streams.

Now that we've covered the core concepts, let's see them applied in some example use cases.

Examples of data streaming

Data streams are essential for supporting a wide variety of workloads. This section will go into detail on how data streams can be used for near real-time monitoring of applications through log aggregation, support bursty IoT workloads, be fast to insert recommendations into web applications, and enable machine learning on video. The following diagram shows the data flow of these workloads:

Figure 1.4 – Examples of data stream applications

Figure 1.4 – Examples of data stream applications

While these workloads have different performance requirements and scale, the fundamental architecture is the same – producing and consuming messages. Now, let's look at an example of real-time monitoring.

Application log processing

Near real-time monitoring of applications and systems can be used to identify usage patterns, troubleshoot operation events, detect and monitor security incidents, and ensure compliance. Log events are generated on multiple systems and are pushed to a centralized system for analysis. Messaging systems enable this by decoupling the log processing and the analysis systems. In general, for log analysis, there are two different systems consuming the messages: one for near real-time analysis and one for larger historical batch analysis. The near real-time analysis system, often Elasticsearch, contains only fresh data as specified by a data retention policy, and might only hold an hour, a day, or a week's worth of information. The historical system is often an Apache Spark cluster processing data in a data lake (data stored in S3).

Log events are generated in real time and are pushed to the messaging system. The two consumers access the data and perform ETL operations on the data to convert it into the appropriate format for further analysis. For instance, an Apache Commons Logging format can be converted to JSON for insertion into Elasticsearch. The message broker simplifies the system by providing a clear boundary between the log collection and log analysis systems. Since it's designed in a highly available manner, it can cache events if the log analysis system goes down.

There are many sources of log events; two common ones are CloudWatch Logs and agents that can be installed on a machine, for example, Kinesis Agent. CloudWatch is an AWS service that collects logs, metrics, and events from AWS resources and user applications. The logs are sent to streams based on subscriptions and subscription filters that define patterns to determine which log events should be sent. The events are Base64-encoded and compressed with gzip. Agents monitor sets of files and stream events normally delineated by a new line (\n) character.

By bringing all the logs together in near real time, proactive measures can be taken. For example, imagine an attacker is trying to use an automated tool, for example, SQLMAP, to perform a SQL injection attack via an HTTP query string. A query string is a set of key-value pairs separated from the base URL by a question mark (?) character, and each key-value pair is separated by the ampersand (&) character. For example, in the following URL, there are two keys, key1 and key2, and their corresponding values, value1 and value2:

https://example.com/mypage?key1=value1&key2=value2

The first thing that will be detected is a lot of query strings that are different, originating from a single IP address. Once the IP address is identified, it can be blocked to prevent further attacks. The analysis system can be used to determine all requests made by the client and detect whether they were able to exploit any vulnerabilities.

Internet of Things

IoT devices present unique challenges as they are often only connected to the internet intermittently to save bandwidth and conserve energy. This intermittent connectivity, combined with a large number of devices, can lead to extremely bursty workloads. For instance, a fleet of IoT devices with temperature sensors might send data back every hour. The messaging system provides a buffer that allows downstream systems to be provisioned for the average velocity of data and not the peak loads.

Real-time recommendations

Clickstream events are generated at extremely high volume and velocity as users navigate and use web applications and mobile applications. Clickstream analysis can be used for A/B testing, understanding user engagement, detecting system issues, and in this example, recommendations.

Simple recommendations can be pre-computed based on historic usage patterns, for instance, people who watched this movie also liked these movies. However, this fails to capture the user's intent – that is, personalized recommendations depending on the user's behavior in the given session. This requires clickstream data to be captured in real time, analyzed, and recommendations made, all in the time it takes for a page to load. In other words, the system needs to work in milliseconds. These performance constraints require highly scalable messaging systems to achieve extremely low latency so that page load performance is not degraded.

Video streams

Video streams can be used for both real-time workloads (chat, peer to peer) or batch (surveillance, machine learning). In the batch case, multiple cameras can be streaming the video to the messaging system and machine learning can be applied to detect faces. These faces can then be identified and checked against a set of known individuals. Any face that doesn't match a known individual can trigger an alert and send the relevant portion of the video to the appropriate person. Messaging frameworks simplify the architecture by providing a highly scalable system to handle large volumes of data from multiple devices. Much like in the IoT case, they also provide a buffer to provide time for downstream resources to be provisioned in response to demand as new devices connect.

Summary

In this chapter, we discussed the need for streams, the types of data they can handle, the core concepts of messaging services, and some examples of how messaging can be applied to support challenging use cases, such as near real-time monitoring and video processing. You should now have a detailed understanding of distributed systems as a solution for scale, what a data stream is, and its properties.

In the next section, we will take what we've learned here and review the messaging services available on Amazon Web Services and introduce Kinesis.

Further reading

Left arrow icon Right arrow icon

Key benefits

  • Get well versed with the capabilities of Amazon Kinesis
  • Explore the monitoring, scaling, security, and deployment patterns of various Amazon Kinesis services
  • Learn how other Amazon Web Services and third-party applications such as Splunk can be used as destinations for Kinesis data

Description

Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale. Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk. By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).

Who is this book for?

This book is for solutions architects, developers, system administrators, data engineers, and data scientists looking to evaluate and choose the most performant, secure, scalable, and cost-effective data streaming technology to overcome their data ingestion and processing challenges on AWS. Prior knowledge of cloud architectures on AWS, data streaming technologies, and architectures is expected.

What you will learn

  • Get to grips with data streams, decoupled design, and real-time stream processing
  • Understand the properties of KFH that differentiate it from other Kinesis services
  • Monitor and scale KDS using CloudWatch metrics
  • Secure KDA with identity and access management (IAM)
  • Deploy KVS as infrastructure as code (IaC)
  • Integrate services such as Redshift, Dynamo Database, and Splunk into Kinesis
Estimated delivery fee Deliver to Colombia

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 31, 2021
Length: 314 pages
Edition : 1st
Language : English
ISBN-13 : 9781800565401
Vendor :
Amazon
Category :
Languages :
Concepts :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Colombia

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Mar 31, 2021
Length: 314 pages
Edition : 1st
Language : English
ISBN-13 : 9781800565401
Vendor :
Amazon
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 168.97
Scalable Data Streaming with Amazon Kinesis
$48.99
Serverless Analytics with Amazon Athena
$54.99
Data Engineering with AWS
$64.99
Total $ 168.97 Stars icon

Table of Contents

12 Chapters
Section 1: Introduction to Data Streaming and Amazon Kinesis Chevron down icon Chevron up icon
Chapter 1: What Are Data Streams? Chevron down icon Chevron up icon
Chapter 2: Messaging and Data Streaming in AWS Chevron down icon Chevron up icon
Chapter 3: The SmartCity Bike-Sharing Service Chevron down icon Chevron up icon
Section 2: Deep Dive into Kinesis Chevron down icon Chevron up icon
Chapter 4: Kinesis Data Streams Chevron down icon Chevron up icon
Chapter 5: Kinesis Firehose Chevron down icon Chevron up icon
Chapter 6: Kinesis Data Analytics Chevron down icon Chevron up icon
Chapter 7: Amazon Kinesis Video Streams Chevron down icon Chevron up icon
Section 3: Integrations Chevron down icon Chevron up icon
Chapter 8: Kinesis Integrations Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(4 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Andrea May 19, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Scalable Data Streaming with Amazon Kinesis really exceeded my expectations. This is one of those rare books that motivates you to read every chapter. Even if you are experienced in this area this book will re-enforce thing you know and bring to light new ways of thinking about solving data streaming problems.
Amazon Verified review Amazon
Bryan Hopkins Jun 13, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Whether you're experienced with high-throughput streaming concepts and just looking for a deeper understanding of how to build with Amazon Kinesis services, or if you're just getting started with modern decoupled microservice architectures with managed services, this is the book for you. The authors will walk you through Well Architected examples, provide sample code, branch out into related open source libraries, and even touch on related best practices for SDLC and monitoring.
Amazon Verified review Amazon
Larry Heathcote Jun 04, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is an excellent resource for anyone developing streaming data pipelines. It's comprehensive with clear code examples to get you started and fine tune existing pipelines. A must-have resource.
Amazon Verified review Amazon
Molly M. Apr 07, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is very well written and the authors did a good job explaining the code samples and core data streaming concepts. Worth every penny to quickly learn all the AWS streaming services! Thank you!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela