What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Machine Learning for Streaming Data with Python

Chapter 1: An Introduction to Streaming Data

Streaming analytics is one of the new hot topics in data science. It proposes an alternative framework to the more standard batch processing, in which we are no longer dealing with datasets on a fixed time of treatment, but rather we are handling every individual data point directly upon reception.

This new paradigm has important consequences for data engineering, as it requires much more robust and, particularly, much faster data ingestion pipelines. It also imposes a big change in data analytics and machine learning.

Until recently, machine learning and data analytics methods and algorithms were mainly designed to work on entire datasets. Now that streaming has become a hot topic, it becomes more and more common to see use cases in which entire datasets just do not exist anymore. When a continuous stream of data is being ingested into a data storage source, there is no natural moment to relaunch an analytics batch job.

Streaming analytics and streaming machine learning models are models that are designed to work specifically with streaming data sources. A part of the solution, for example, is in the updating. Streaming analytics and machine learning need to update all the time as new data is being received. When updating, you may also want to forget the much older data.

This and other problems that are introduced by moving from batch analytics to streaming analytics need a different approach to analytics and machine learning. This book will lay out the basis for getting you started with data analytics and machine learning on data that is received as a continuous stream.

In this first chapter, you'll get a more solid understanding of the differences between streaming and batch data. You'll see some example use cases that showcase the importance of working with streaming rather than converting back into batch. You'll also start working with a first Python example to get a feel for the type of work that you'll be doing throughout this book.

In later chapters, you'll see some more background notions on architecture and, then, you'll go into a number of data science and analytics use cases and how they can be adapted to the new streaming paradigm.

In this chapter, you will discover the following topics:

A short history of data science
Working with streaming data
Real-time data formats and importing an example dataset in Python

A short history of data science

Over the last few years, new technology domains have quickly taken over a lot of parts of the world. Machine learning, artificial intelligence, and data science are new fields that have entered our daily life, both in our personal lives and in our professional lives.

The topics that data scientists work on today are not new. The absolute foundation of the field is in mathematics and statistics, two fields that have existed for centuries. As an example, least squares regression was first published in 1805. With time, mathematicians and statisticians have continued working on finding other methods and models.

In the following timeline, you can see how the recent boom in technology has been able to take place. In the 1600s and 1700s, very smart people were already laying the foundations for what we still do in statistics and mathematics today. However, it was not until the invention and popularization of computing power that the field became booming.

Figure 1.2 – A timeline of the history of data

Personal computer and internet accessibility is an important reason for data science's popularity today. Almost everyone has a computer that is performant enough for fairly complex machine learning. This strongly helps computer literacy, but also, online documentation accessibility is a big booster for learning.

The availability of big data tools such as Hadoop and Spark is also an important part of the popularization of data science, as they allow practitioners to work with datasets that are larger than anyone could ever imagine before.

Lastly, cloud computing is allowing data scientists from all over the world to access very powerful hardware at low prices. Especially for big data tools, the hardware needed is still priced in a way that most students would not be able to buy it for training purposes. Cloud computing gives access to those use cases for many.

In this book, you will learn how to work with streaming data. It is important to have this short history of data science in mind, as streaming data is one of those technologies that has been disadvantaged by the need for difficult hardware and setup requirements. Streaming data is currently gaining popularity quickly in many domains and has the potential to be a big hit in the coming period. Let's now have a deeper look into the definition of streaming data.

Working with streaming data

Streaming data is data that is streamed. You may know the term streaming from online video services on which you can stream video. When doing this, the video streaming service will continue sending the next parts of the video to you while you are already watching the first part of the video.

The concept is the same when working with streaming data. The data format is not necessarily video and can be any data type that is useful for your use case. One of the most intuitive examples is that of an industrial production line, in which you have continuous measurements from sensors. As long as your production line doesn't pause, you will continue to generate measurements. We will check out the following overview of the data streaming process:

Figure 1.3 – The data streaming process

The important notion is that you have a continuous flow of data that you need to treat in real time. You cannot wait until the production line stops to do your analysis, as you would need to detect potential problems right away.

Streaming data versus batch data

Streaming data is generally not among the first use cases that new data scientists tend to start with. The type of problem that is usually introduced first is batch use cases. Batch data is the opposite of streaming data, as it works in phases: you collect a bunch of data, and then you treat a bunch of data.

If you see streaming data as streaming a video online, you could see batch data as downloading the entire video first and then watching it when the downloading is finished. For analytical purposes, this would mean that you get the analysis of a bunch of data when the data generating process is finished rather than whenever a problem occurs.

For some use cases, this is not a problem. Yet, you can understand that streaming can deliver great added value in those use cases where fast analytics can have an impact. It also has added value in use cases where data is ingested in a streaming method, which is becoming more and more common. In practice, many use cases that would get added value through streaming are still solved with batch treatment, just because these methods are better known and more widespread.

The following overview shows the batch treatment process:

Figure 1.4 – The batch process

Advantages of streaming data

Let's now look at some advantages of using streaming analytics rather than other approaches in the following subsections.

Data generating processes are in real time

The first advantage of building streaming data analytics rather than batch systems is that many data generating processes are actually in real time. You will discover a number of use cases later, but in general, it is rare that data collection is done in batches.

Although most of us are used to building batch systems around real-time data generating systems, it often makes more sense to build streaming analytics directly.

Of course, batch analytics and streaming analytics can co-exist. Yet, adding a batch treatment to a streaming analytics service is often much easier than adding streaming functionality into a system that is designed for batches. It simply makes the most sense to start with streaming.

Real-time insights have value

When designing data science solutions, streaming does not always come to mind first. However, when solutions or tools are built in real time, it is rare that the real-time functionality is not appreciated.

Many analytical solutions of today are built in real time and the tools are available. In many problems, real-time information will be used at some point. Maybe it will not be used from the start, but the day that anomalies happen, you will find a great competitive advantage in having the analytics straight away, rather than waiting till the next hour or the next morning.

Examples of successful implementation of streaming analytics

Let's talk about some examples of companies that have implemented real-time analytics successfully. The first example is Shell. They have been able to implement real-time analytics of their security cameras on their gas stations. An automated and real-time machine learning pipeline is able to detect whether people are smoking.

Another example is the use of sensor data in connected sports equipment. By measuring heart rate and other KPIs in real time, they are able to alert you when anything is wrong with your body.

Of course, the big players such as Facebook and Twitter also analyze a lot of data in real time, for example, when detecting fake news or bad content. There are many successful use cases of streaming analytics, yet at the same time, there are some common challenges that streaming data brings with them. Let's have a look at them now.

Challenges of streaming data

Streaming data analytics are currently less widespread than batch data analytics. Although this is slowly changing, it is good to understand where the challenges are when working with streaming data.

Knowledge of streaming analytics

One simple reason for streaming analytics being less widespread is a question of knowledge and know-how. Setting up streaming analytics is often not taught in schools and is definitely not taught as the go-to method. There are also fewer resources available on the internet to get started with it. As there are much more resources on machine learning and analytics for batch treatment, and the batch methods do not apply to streaming data, people tend to start with batch applications for data science.

Understanding the architecture

A second difficulty when working on streaming data is architecture. Although some data science practitioners have knowledge of architecture, data engineering, and DevOps, this is not always the case. To set up a streaming analytics proof of concept or a minimum viable product (MVP), all those skills are needed. For batch treatment, it is often enough to work with scripts.

Architectural difficulties are inherent to streaming, as it is necessary to work with real-time processes that send individually collected records to an analytical treatment process that will update in real time. If there is no architecture that can handle this, it does not make much sense to start with streaming analytics.

Financial hurdles

Another challenge when working with streaming data is the financial aspect. Although working with streaming is not necessarily more expensive in the long run, it can be more expensive to set up the infrastructure needed to get started. Working on a local developer PC for an MVP is unlikely to succeed as the data needs to be treated in real time.

Risks of runtime problems

Real-time processes also have a larger risk of runtime problems. When building software, bugs and failures happen. If you are on a daily batch process, you may be able to repair the process, rerun the failed batch, and solve the problem.

If a streaming tool is down, there are risks of losing data. As the data should be ingested in real time, the data that is generated during a time-out of your process may not be recuperable. If your process is very important, you will need to set up extensive monitoring day and night and have more quality checks before pushing your solutions to production. Of course, this is also important in batch processes, but even more so in streaming.

Smaller analytics (fewer methods easily available)

The last challenge of streaming analytics is that the common methods are generally developed for batch data first. There are currently many solutions out there for analytics on real time and streaming data, but still not as many as for batch data.

Also, since the streaming analysis has to be done very quickly to respect real-time delivery, streaming use cases tend to end up with much less interesting analytical methodologies and stay at the basic level of descriptive or basic analyses.

How to get started with streaming data

For companies to get started with streaming data, the first step is often to start by putting in place simple applications that collect real-time data and make that real-time data accessible in real time. Common use cases to start with are log data, website visits data, or sensor data.

A next step would often be to build reporting tools on top of the real-time data source. You can think about KPI dashboards that update in real time, or small and simple alerting tools based on high or low threshold values based on business rules.

When such systems are in place, this leads the way to replace those business rules, or add on top of them. You can think about more advanced analytics tools including real-time machine learning for anomaly detection and more.

The most complex step is to add automated feedback loops between your real-time machine learning and your process. After all, there is no reason to stop at analytics for business insights if there is potential to automate and improve decision-making as well.

Common use cases for streaming data

Let's see a few of the most common use cases for streaming data so that you can get a better feel of the use cases that can benefit from streaming techniques. This will cover three use cases that are relatively accessible for anyone, but of course, there are many more.

Sensor data and anomaly detection

A common use case for streaming data is the analysis of sensor data. Sensor data can occur in a multitude of use cases, such as industry production lines and IoT use cases. When companies decide to collect sensor data, it is often treated in real time.

For a production line, there is great value in detecting anomalies in real time. When too many anomalies occur, the production line can be shut down or the problem can be solved before a number of faulty products are delivered.

A good example of streaming analytics for monitoring humidity for artwork can be found here: https://azure.github.io/iot-workshop-asset-tracking/step-003-anomaly-detection/.

Finance and regression forecasting

Finance data is another great use case for streaming data. For example, in the world of stock trading, timing is important. The faster you can detect up or downtrends in the stock market, the faster a trader (or algorithm) can react by selling or buying stocks and making money.

A great example is described in the following paper by K.S Umadevi et al (2018): https://ieeexplore.ieee.org/document/8554561.

Clickstream for websites and classification

Websites or apps are a third common use case for real-time insights. If you can track and analyze your visitors in real time, you can propose a personalized experience for them on your website. By proposing products or services that match with a website visitor, you can increase your online sales.

The following paper by Ramanna Hanamanthrao and S Thejaswini (2017) gives a great use case for this technology applied to clickstream data: https://ieeexplore.ieee.org/abstract/document/8256978.

Streaming versus big data

It is important to understand different definitions of streaming that you may encounter. One distinction to make is between streaming and big data. Some definitions will consider streaming mainly in a big data (Hadoop/Spark) context, whereas others do not.

Streaming solutions often have a large volume of data, and big data solutions can be the appropriate choice. However, other technologies, combined with a well-chosen hardware architecture, may also be able to do the analytics in real time and, therefore, build streaming solutions without big data technologies.

Streaming versus real-time inference

Real-time inference of models is often built and made accessible via an API. As we define streaming as the analysis of data in real time without batches, such predictions in real time can be considered streaming. You will see more about real-time architectures in a later chapter.

Real-time data formats and importing an example dataset in Python

To finalize this chapter, let's have a look at how to represent streaming data in practice. After all, when building analytics, we will often have to implement test cases and example datasets.

The simplest way to represent streaming data in Python would be to create an iterable object that contains the data and to build your analytics function to work with an iterable.

The following code creates a DataFrame using pandas. There are two columns, temperature and pH:

Code block 1-1

import pandas as pd

data_batch = pd.DataFrame({

'temperature': [10, 11, 10, 11, 12, 11, 10, 9, 10, 11, 12, 11, 9, 12, 11],

    ‹pH›: [5, 5.5, 6, 5, 4.5, 5, 4.5, 5, 4.5, 5, 4, 4.5, 5, 4.5, 6]

})

print(data_batch)

When showing the DataFrame, it will look as follows. The pH is around 4.5/5 but is sometimes higher. The temperature is generally around 10 or 11.

Figure 1.5 – The resulting DataFrame

This dataset is a batch dataset; after all, you have all the rows (observations) at the same time. Now, let's see how to convert this dataset to a streaming dataset by making it iterable.

You can do this by iterating through the data's rows. When doing this, you set up a code structure that allows you to add more building blocks to this code one by one. When your developments are done, you will be able to use your code on a real-time stream rather than on an iteration of a DataFrame.

The following code iterates through the rows of the DataFrame and converts the rows to JSON format. This is a very common format for communication between different systems. The JSON of the observation contains a value for temperature and a value for pH. Those are printed out as follows:

Code block 1-2

data_iterable = data_batch.iterrows()

for i,new_datapoint in data_iterable:

  print(new_datapoint.to_json())

After running this code, you should obtain a print output that looks like the following:

Figure 1.6 – The resulting print output

Let's now define a super simple example of streaming data analytics. The function that is defined in the following code block will print an alert whenever the temperature gets below 10:

Code block 1-3

def super_simple_alert(datapoint):

  if datapoint[‹temperature›] < 10:

    print('this is a real time alert. temp too low')

You can now add this alert into your simulated streaming process simply by calling the alerting test at every data point. You can use the following code to do this:

Code block 1-4

data_iterable = data_batch.iterrows()

for i,new_datapoint in data_iterable:

  print(new_datapoint.to_json())

  super_simple_alert(new_datapoint)

When executing this code, you will notice that alerts will be given as soon as the temperature goes below 10:

Figure 1.7 – The resulting print output with alerts on temperature

This alert works only on the temperature, but you could easily add the same type of alert on pH. The following code shows how this can be done. The alert function could be updated to include a second business rule as follows:

Code block 1-5

def super_simple_alert(datapoint):

  if datapoint[‹temperature›] < 10:

    print('this is a real time alert. temp too low')

  if datapoint[‹pH›] > 5.5:

    print('this is a real time alert. pH too high')

Executing the function would still be done in exactly the same way:

Code block 1-6

data_iterable = data_batch.iterrows()

for i,new_datapoint in data_iterable:

  print(new_datapoint.to_json())

  super_simple_alert(new_datapoint)

You will see several alerts being raised throughout the execution on the example streaming data, as follows:

Figure 1.8 – The resulting print output with alerts on temperature and pH

With streaming data, you have to decide without seeing the complete data but just on those data points that have been received in the past. This means that there is a need for a different approach to redeveloping algorithms that are similar to batch processing algorithms.

Throughout this book, you will discover methods that apply to streaming data. The difficulty, as you may understand, is that a statistical method is generally developed to compute things using all the data.

Key benefits

Work on streaming use cases that are not taught in most data science courses

Gain experience with state-of-the-art tools for streaming data

Mitigate various challenges while handling streaming data

Description

Streaming data is the new top technology to watch out for in the field of data science and machine learning. As business needs become more demanding, many use cases require real-time analysis as well as real-time machine learning. This book will help you to get up to speed with data analytics for streaming data and focus strongly on adapting machine learning and other analytics to the case of streaming data. You will first learn about the architecture for streaming and real-time machine learning. Next, you will look at the state-of-the-art frameworks for streaming data like River. Later chapters will focus on various industrial use cases for streaming data like Online Anomaly Detection and others. As you progress, you will discover various challenges and learn how to mitigate them. In addition to this, you will learn best practices that will help you use streaming data to generate real-time insights. By the end of this book, you will have gained the confidence you need to stream data in your machine learning models.

Who is this book for?

This book is for data scientists and machine learning engineers who have a background in machine learning, are practice and technology-oriented, and want to learn how to apply machine learning to streaming data through practical examples with modern technologies. Although an understanding of basic Python and machine learning concepts is a must, no prior knowledge of streaming is required.

What you will learn

Understand the challenges and advantages of working with streaming data

Develop real-time insights from streaming data

Understand the implementation of streaming data with various use cases to boost your knowledge

Develop a PCA alternative that can work on real-time data

Explore best practices for handling streaming data that you absolutely need to remember

Develop an API for real-time machine learning inference

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

Modern Time Series Forecasting with Python

$52.99

Time Series Analysis with Python Cookbook

$51.99

Machine Learning for Streaming Data with Python

$46.99

Total $ 151.97

Filter reviews by

All

Amazon verified reviews

Syeman Feb 21, 2023

The book is well organized and provides important concepts for working with streaming data for use in machine learning. An aspect I like about it is the exposure to tools to be used for different parts of the process.

Amazon Verified review

Kim ly Oct 18, 2022

I have been working on big data analysis, especial streaming data, this book have saved me so much times to watch tutorial, The Author has provided a lot of coding example that I can learn and apply for my project. More than that, this book also very useful to explain the complex terminology or concept about big data. Highly Recommend.

Amazon Customer Sep 28, 2022

This book is about stream data machine learning using Python library River. The stream ML is different from regular ML.The book discusses a lot of applications using River, such as Online Anomaly Detection, Online Classification, Online Regression, Reinforcement Learning and Drift and Drift Detection, et al.It offers ready to use codes for the popular algorithms, OneClassSVM, Isolation Forest (HalfSpaceTrees), LogisticRegression, Perceptron(), RandomForest, ALMAClassifier, passive-aggressive (PA) classifier, LinearRegression, HoeffdingAdaptiveTreeRegressor, SGTRegressor, SRPRegressor.I like this book and I think it is a good book for the readers who want to learn stream data ML.

@maxgoff Aug 20, 2022

Review of Machine Learning for Streaming Data with Python(authored by Joos Korstanje)"Streaming viewership surpassed cable TV for the first time, says Nielsen”-- Headline from TechCrunch Article, 18 August 2022Data science is a calling.As Jennifer Shin, Senior Principal Data Scientist at Nielsen is quoted as saying:“’Possessed’ is probably the right word. I often tell people, ‘I don’t want to necessarily be a data scientist. You just kind of are a data scientist. You just can’t help but look at that data set and go, ‘I feel like I need to look deeper. I feel like that’s not the right fit.’”I think it’s interesting that I am writing this review of this particular book at this particular time, when Nielsen is reporting the (inevitable) ascendency of streaming viewership, (inevitably) surpassing that of cable. The trend in that direction has been clear for years now. And we hit that particular milestone just as Joos’ text is being published. Good timing, coincidence, dharma or part of the Great Universe’s Master Plan, the fact is, the knowledge from this text must be assimilated well and quickly by practitioners of the Art and Science of Machine Learning in production environments today.Streaming is the future of data processing. Especially with a doubling of IoT-connected devices over the next four years, each one generating real-time feeds, each device begging for immediate consumption of their data, Machine Learning for Streaming Data must be mastered by those of us, like Jennifer, who are possessed by this calling.If you haven’t used the River package in python, this book offers a very useful tutorial. River is a library to build online machine learning models using python. What’s an ‘online ML model?’ It’s a term meant to differentiate between more traditional approaches to ML, called offline learning.Offline learning is an approach that ingests all the data at one time to build a model whereas online learning is an approach that ingests data one observation at a time.Online ML models operate on data streams. But the concept of a data stream is a bit vague.In general, a data stream is a sequence of individual elements. In the case of machine learning, each element is a bunch of features. We call these samples, or observations. Each sample might follow a fixed structure and always contain the same features. But features can also appear and disappear over time, depending on the use case.Regardless of data source or use case, the River package can be very useful when it comes to ML for streaming data.I enjoyed digesting this book. If you write code and need to jump-start your understanding of ML for streaming data, this is the text for you. Joos’ book with associated code provides a quick introduction to the field with sufficient code examples to get you well on your way.

Sonali Aug 30, 2022

This book nicely translates fundamentals of both classical Machine Learning using descriptive statistics as well as Deep Learning into its streaming counterpart. Streaming analytics is a lesser ventured area and not much research is available both from academia as well as industry. Given scarcity of resources on this topic, the author has done a great job in explaining existing Machine Learning algorithms using streaming context. The concept is nicely backed by coding examples which are easy to follow.In addition to Machine Learning concepts for streaming data, this book also discusses issues with data and best practices with streaming data as data drift. This is so important and often missed in productization of Machine Learning Models.And last but not the least, the book discusses in-depth on using reinforcement learning techniques for streaming data. This is again a novel concept and has many applications typically in the financial domain.Overall, I thoroughly enjoyed the book and am eager to apply some of the concepts discussed!