In this section, you will learn what data engineering is and why it is needed. You will also learn about the various categories of data engineering problems and some real-world scenarios where they are found. It is important to understand the varied nature of data engineering problems before you learn how to architect solutions for such real-world problems.
What is data engineering?
By definition, data engineering is the branch of software engineering that specializes in collecting, analyzing, transforming, and storing data in a usable and actionable form.
With the growth of social platforms, search engines, and online marketplaces, there has been an exponential increase in the rate of data generation. In 2020 alone, around 2,500 petabytes of data was generated by humans each day. It is estimated that this figure will go up to 468 exabytes per day by 2025. The high volume and availability of data have enabled rapid technological development in AI and data analytics. This has led businesses, corporations, and governments to gather insights like never before to give customers a better experience of their services.
However, raw data usually is seldom used. As a result, there is an increased demand for creating usable data, which is secure and reliable. Data engineering revolves around creating scalable solutions to collect the raw data and then analyze, validate, transform, and store it in a usable and actionable format. Optionally, in certain scenarios and organizations, in modern data engineering, businesses expect usable and actionable data to be published as a service.
Before we dive deeper, let’s explore a few practical use cases of data engineering:
- Use case 1: American Express (Amex) is a leading credit card provider, but it has a requirement to group customers with similar spending behavior together. This ensures that Amex can generate personalized offers and discounts for targeted customers. To do this, Amex needs to run a clustering algorithm on the data. However, the data is collected from various sources. A few data flows from MobileApp, a few flows from different Salesforce organizations such as sales and marketing, and a few data flows from logs and JSON events will be required. This data is known as raw data, and it can contain junk characters, missing fields, special characters, and sometimes unstructured data such as log files. Here, the data engineering team ingests that data from different sources, cleans it, transforms it, and stores it in a usable structured format. This ensures that the application that performs clustering can run on clean and sorted data.
- Use case 2: A health insurance provider receives data from multiple sources. This data comes from various consumer-facing applications, third-party vendors, Google Analytics, other marketing platforms, and mainframe batch jobs. However, the company wants a single data repository to be created that can serve different teams as the source of clean and sorted data. Such a requirement can be implemented with the help of data engineering.
Now that we understand data engineering, let’s look at a few of its basic concepts. We will start by looking at the dimensions of data.
Dimensions of data
Any discussion on data engineering is incomplete without talking about the dimensions of data. The dimensions of data are some basic characteristics by which the nature of data can be analyzed. The starting point of data engineering is analyzing and understanding the data.
To successfully analyze and build a data-oriented solution, the four Vs of modern data analysis are very important. These can be seen in the following diagram:
Figure 1.1 – Dimensions of data
Let’s take a look at each of these Vs in detail:
- Volume: This refers to the size of data. The size of the data can be as small as a few bytes to as big as a few hundred petabytes. Volume analysis usually involves understanding the size of the whole dataset or the size of a single data record or event. Understanding the size is essential in choosing the type of technologies and infrastructure sizing decisions to process and store the data.
- Velocity: This refers to the speed at which data is getting generated. High-velocity data requires distributed processing. Analyzing the speed of data generation is especially critical for scenarios where businesses require usable data to be made available in real-time or near-real-time.
- Variety: This refers to the various variations in the format in which the data source can generate the data. Usually, they can be one of the three following types:
- Structured: Structured data is where the number of columns, their data types, and their positions are fixed. All classical datasets that fit neatly in the relational data model are perfect examples of structured data.
- Unstructured: These datasets don’t conform to a specific structure. Each record in such a dataset can have any number of columns in any arbitrary format. Examples include audio and video files.
- Semi-structured: Semi-structured data has a structure, but the order of the columns and the presence of a column in each record is optional. A classical example of such a dataset is any hierarchical data source, such as a
.json
or a .xml
file.
- Veracity: This refers to the trustworthiness of the data. In simple terms, it is related to the quality of the data. Analyzing the noise of data is as important as analyzing any other aspect of the data. This is because this analysis helps create a robust processing rule that ultimately determines how successful a data engineering solution is. Many well-engineered and designed data engineering solutions fail in production due to a lack of understanding about the quality and noise of the source data.
Now that we have a fair idea of the characteristics by which the nature of data can be analyzed, let’s understand how they play a vital role in different types of data engineering problems.
Types of data engineering problems
Broadly speaking, the kinds of problems that data engineers solve can be classified into two basic types:
- Processing problems
- Publishing problems
Let’s take a look at these problems in more detail.
Processing problems
The problems that are related to collecting raw data or events, processing them, and storing them in a usable or actionable data format are broadly categorized as processing problems. Typical use cases can be a data ingestion problem such as Extract, Transform, Load (ETL) or a data analytics problem such as generating a year-on-year report.
Again, processing problems can be divided into three major categories, as follows:
- Batch processing
- Real-time processing
- Near real-time processing
This can be seen in the following diagram:
Figure 1.2 – Categories of processing problems
Let’s take a look at each one of these categories in detail.
Batch processing
If the SLA of processing is more than 1 hour (for example, if the processing needs to be done once in 2 hours, once daily, once weekly, or once biweekly), then such a problem is called a batch processing problem. This is because, when a system processes data at a longer time interval, it usually processes a batch of data records and not a single record/event. Hence, such processing is called batch processing:
Figure 1.3 – Batch processing problem
Usually, a batch processing solution depends on the volume of data. If the data volume is more than tens of terabytes, usually, it needs to be processed as big data. Also, since big data processes are schedule-driven, a workflow manager or schedular needs to run its jobs. We will discuss batch processing in more detail later in this book.
Real-time processing
A real-time processing problem is a use case where raw data/events are to be processed on the fly and the response or the processing outcome should be available within seconds, or at most within 2 to 5 minutes.
As shown in the following diagram, a real-time process receives data in the form of an event stream and immediately processes it. Then, it either sends the processed event to a sink or to another stream of events to be processed further. Since this kind of processing happens on a stream of events, this is known as real-time stream processing:
Figure 1.4 – Real-time stream processing
As shown in Figure 1.4, event E0 gets processed and sent out by the streaming application, while events E1, E2 and E3 are waiting to be processed in the queue. At t1, event E1 also gets processed, showing continuous processing of events by streaming application
An event can generate at any time (24/7), which creates a new kind of problem. If the producer application of an event directly sends the event to a consumer, there is a chance of event loss, unless the consumer application is running 24/7. Even bringing down the consumer application for maintenance or upgrades isn’t possible, which means there should be zero downtime for the consumer application. However, any application with zero downtime is not realistic. Such a model of communication between applications is called point-to-point communication.
Another challenge in point-to-point communication for real-time problems is the speed of processing as this should be always equal to or greater than that of a producer. Otherwise, there will be a loss of events or a possible memory overrun of the consumer. So, instead of directly sending events to the consumer application, they are sent asynchronously to an Event Bus or a Message Bus. An Event Bus is a high availability container that can hold events such as a queue or a topic. This pattern of sending and receiving data asynchronously by introducing a high availability Event Bus in between is called the Pub-Sub framework.
The following are some important terms related to real-time processing problems:
- Events: This can be defined as a data packet generated as a result of an action, a trigger, or an occurrence. They are also popularly known as messages in the Pub-Sub framework.
- Producer: A system or application that produces and sends events to a Message Bus is called a publisher or a producer.
- Consumer: A system or application that consumes events from a Message Bus to process is called a consumer or a subscriber.
- Queue: This has a single producer and a single consumer. Once a message/event is consumed by a consumer, that event is removed from the queue. As an analogy, it’s like an SMS or an email sent to you by one of your friends.
- Topic: Unlike a queue, a topic can have multiple consumers and producers. It’s a broadcasting channel. As an analogy, it’s like a TV channel such as HBO, where multiple producers are hosting their show, and if you have subscribed to that channel, you will be able to watch any of those shows.
A real-world example of a real-time problem is credit card fraud detection, where you might have experienced an automated confirmation call to verify the authenticity of a transaction from your bank, if any transaction seems suspicious while being executed.
Near-real-time processing
Near-real-time processing, as its name suggests, is a problem whose response or processing time doesn’t need to be as fast as real time but should be less than 1 hour. One of the features of near-real-time processing is that it processes events in micro batches. For example, a near-real-time process may process data in a batch interval of every 5 minutes, a batch size of every 100 records, or a combination of both (whichever condition is satisfied first).
At time tx, all events (E1, E2 and E3) that are generated between t0 and tx are processed together by near real-time processing job. Similarly all events (E4, E5 and E6) between time tx and tn are processed together.
Figure 1.5 – Near-real-time processing
Typical near-real-time use cases are recommendation problems such as product recommendations for services such as Amazon or video recommendations for services such as YouTube and Netflix.
Publishing problems
Publishing problems deal with publishing the processed data to different businesses and teams so that data is easily available with proper security and data governance. Since the main goal of the publishing problem is to expose the data to a downstream system or an external application, having extremely robust data security and governance is essential.
Usually, in modern data architectures, data is published in one of three ways:
- Sorted data repositories
- Web services
- Visualizations
Let’s take a closer look at each.
Sorted data repositories
Sorted data repositories is a common term used for various kinds of repositories that are used to store processed data. This is usable and actionable data and can be directly queried by businesses, analytics teams, and other downstream applications for their use cases. They are broadly divided into three types:
- Data warehouse
- Data lake
- Data hub
A data warehouse is a central repository of integrated and structured data that’s mainly used for reporting, data analysis, and Business Intelligence (BI). A data lake consists of structured and unstructured data, which is mainly used for data preparation, reporting, advanced analytics, data science, and Machine Learning (ML). A data hub is the central repository of trusted, governed, and shared data, which enables seamless data sharing between diverse endpoints and connects business applications to analytic structures such as data warehouses and data lakes.
Web services
Another publishing pattern is where data is published as a service, popularly known as Data as a Service. This data publishing pattern has many advantages as it enables security, immutability, and governance by design. Nowadays, as cloud technologies and GraphQL are becoming popular, Data-as-a-Service is getting a lot of traction in the industry.
The two popular mechanisms of publishing Data as a Service are as follows:
We will discuss these techniques in detail later in this book.
Visualization
There’s a popular saying: A picture is worth a thousand words. Visualization is a technique by which reports, analytics, and statistics about the data are captured visually in graphs and charts.
Visualization is helpful for businesses and leadership to understand, analyze, and get an overview of the data flowing in their business. This helps a lot in decision-making and business planning.
A few of the most common and popular visualization tools are as follows:
- Tableau is a proprietary data visualization tool. This tool comes with multiple source connectors to import data into it and create easy fast visualization using drag-and-drop visualization components such as graphs and charts. You can find out more about this product at https://www.tableau.com/.
- Microsoft Power BI is a proprietary tool from Microsoft that allows you to collect data from various data sources to connect and create powerful dashboards and visualizations for BI. While both Tableau and Power BI offer data visualization and BI, Tableau is more suited for seasoned data analysts, while Power BI is useful for non-technical or inexperienced users. Also, Tableau works better with huge volumes of data compared to Power BI. You can find out more about this product at https://powerbi.microsoft.com/.
- Elasticsearch-Kibana is an open source tool whose source code is open source and has free versions for on-premise installations and paid subscriptions for cloud installation. This tool helps you ingest data from any data source into Elasticsearch and create visualizations and dashboards using Kibana. Elasticsearch is a powerful text-based Lucene search engine that not only stores the data but enables various kinds of data aggregation and analysis (including ML analysis). Kibana is a dashboarding tool that works together with Elasticsearch to create very powerful and useful visualizations. You can find out more about these products at https://www.elastic.co/elastic-stack/.
Important note
A Lucene index is a full-text inverse index. This index is extremely powerful and fast for text-based searches and is the core indexing technology behind most search engines. A Lucene index takes all the documents, splits them into words or tokens, and then creates an index for each word.
- Apache Superset is a completely open source data visualization tool (developed by Airbnb). It is a powerful dashboarding tool and is completely free, but its data source connector support is limited, mostly to SQL databases. A few interesting features are its built-in role-based data access, an API for customization, and extendibility to support new visualization plugins. You can find out more about this product at https://superset.apache.org/.
While we have briefly discussed a few of the visualization tools available in the market, there are many visualizations and competitive alternatives available. Discussing data visualization in more depth is beyond the scope of this book.
So far, we have provided an overview of data engineering and the various types of data engineering problems. In the next section, we will explore what role a Java data architect plays in the data engineering landscape.