Data engineering is the process of converting raw data into analytics-ready data that is more accessible, usable, and consumable than its raw format. Modern companies are increasingly becoming data-driven, which means they use data to make business decisions to give them better insights into their customers and business operations. They can use these to improve profitability, reduce costs, and give them a competitive edge in the market. Behind the scenes, a series of tasks and processes are performed by a host of data personas who build reliable pipelines to source, transform, and analyze data so that it is a repeatable and mostly automated process.
Different systems produce different datasets that need to function as individual units and are brought together to provide a holistic view of the state of the business – for example, a customer buying merchandise through different channels such as the web, in-app, or in-store. Analyzing activity in all the channels will help predict the next customer purchase and possibly the next channel type as well. In other words, having all the datasets in one place can help answer questions that couldn't be answered by the individual systems. So, data consolidation is an industry trend that breaks down individual silos. However, each of the systems may have been designed differently, as well as different requirements and service-level agreements (SLAs), and now all of that needs to be normalized and consolidated in a single place to facilitate better analytics.
The following diagram compares the process of farming to that of processing and refining data. In both setups, there are different producers and consumers and a series of refining and packaging steps:
Figure 1.1 – Farming compared to a data pipeline
In this analogy, there is a farmer, and the process consists of growing crops, harvesting them, and making them available in a grocery store. This produce eventually becomes a ready-to-eat meal. Similarly, a data engineer is responsible for creating ready-to-consume data so that each consumer does not have to invest in the same heavy lifting. Each cook taps into different points of the pipeline and makes different recipes based on the specific needs of the use cases that need to be catered for. However, the freshness and quality of the produce are what make for a delightful meal, irrespective of the recipe that's used.
We are at the interesting conjunction of big data, the cloud, and artificial intelligence (AI), all of which are fueling tremendous innovation in every conceivable industry vertical and generating data exponentially. Data engineering is increasingly important as data drives business use cases in every industry vertical. You may argue that data scientists and machine learning practitioners are the unicorns of the industry, and they can work their magic for business. That is certainly a stretch of the imagination. Simple algorithms and a lot of good reliable data produce better insights than complicated algorithms with inadequate data. Some examples of how pivotal data is to the very existence of some of these businesses are listed in the following section.
Use cases
In this section, we've taken a popular use case from a few industry verticals to highlight how data is being used as a driving force for their everyday operations and the scale of data involved:
- Security Incident and Event Management (SIEM) cyber security systems for threat detection and prevention.
This involves user activity monitoring and auditing for suspicious activity patterns and entails collecting a large volume of logs across several devices and systems, analyzing them in real time, correlating data, and reporting on findings via alerts and dashboard refreshes.
- Genomics and drug development in health and life sciences.
The Human Genome project took almost 15 years to complete. A single human genome requires about 100 gigabytes of storage, and it is estimated that by 2025, 40 exabytes of data will be required to process and store all the sequenced genomes. This data helps researchers understand and develop cures that are more targeted and precise.
Autonomous vehicles use a lot of unstructured image data that's been generated from cameras on the body of the car to make safe driving decisions. It is estimated that an active vehicle generates about 5 TB every hour. Some of it will be thrown away after a decision has been made, but a part of it will be saved both locally as well as transmitted to a data center for long-term trend monitoring.
- IoT sensors in Industry 4.0 smart factories in manufacturing.
Smart manufacturing and the Industry 4.0 revolution, which are powered by advances in IoT, are enabling a lot of efficiencies in machine and human utilization on the shop floor. Data is at the forefront of scaling these smart factory initiatives with real-time monitoring, predictive maintenance, early alerting, and digital twin technology to create closed-loop operations.
- Personalized recommendations in retail.
In an omnichannel experience, personalization helps retailers engage better with their customers, irrespective of the channel they choose to engage with, all while picking up the relevant state from the previous channel they may have used. They can address concerns before the customer churns to a competitor. Personalization at scale can not only deliver a percentage lift in sales but can also reduce marketing and sales costs.
Games such as Fortnite and Minecraft have captivated children and adults alike who spend several hours in a multi-player online game session. It is estimated that Fortnite generates 100 MB of data per user, per hour. Music and video streaming also rely a lot on recommendations for new playlists. Netflix receives more than a million new ratings every day and uses several parameters to bin users to understand similarities in their preferences.
The agriculture market in North America is estimated to be worth 6.2 billion US dollars and uses big data to understand weather patterns for smart irrigation and crop planting, as well as to check soil conditions for the right fertilizer dose. John Deere uses computer vision to detect weeds and can localize the use of sprays to help preserve the quality of both the environment and the produce.
- Fraud detection in the Fintech sector.
Detecting and preventing fraud is a constant effort as fraudsters find new ways to game the system. Because we are constantly transacting online, a lot of digital footprints are left behind. By some estimates, about 10% of insurance company payments are made due to fraud. AI techniques such as biometric data and ML algorithms can detect unusual patterns, which leads to better monitoring and risk assessment so that the user can be alerted before a lot of damage is done.
- Forecasting use cases across a wide variety of verticals.
Every business has some need for forecasting, either to predict sales, stock inventory, or supply chain logistics. It is not as straightforward as projection – other patterns influence this, such as seasonality, weather, and shifts in micro or macro-economic conditions. Data that's been augmented over several years by additional data feeds helps create more realistic and accurate forecasts.
How big is big data?
90% of the data that's generated thus far has been generated in the last 2 years alone. At the time of writing, it is estimated that 2.5 quintillion (18 zeros) bytes of data is produced every day. A typical commercial aircraft generates 20 terabytes of data per engine every hour it's in flight.
We are just at the beginning stages of autonomous driving vehicles, which rely on data points to operate. The world's population is about 7.7 billion. The number of connected devices is about 10 billion, with portions of the world not yet connected by the internet. So, this number will only grow as the exodus of IoT sensors and other connected devices grows. People have an appetite to use apps and services that generate data, including search functionalities, social media, communication, services such as YouTube and Uber, photo and video services such as Snapchat and Facebook, and more. The following statistics give you a better idea of the data that's generated all around us and how we need to swim effectively through all the waves and turbulences that they create to digest the most useful nuggets of information.
Every minute, the following occurs (approximately):
- 16 million text messages
- 1 million Tinder swipes
- 160 million emails
- 4 million YouTube videos
- 0.5 million tweets
- 0.5 million Snapchat shares
With so much data being generated, there is a need for robust data engineering tools and frameworks and reliable data and analytics platforms to harness this data and make sense of it. This is where data engineering comes to the rescue. Data is as important an asset as code is, so there should be governance around it. Structured data only accounts for 5-10% of enterprise data; semi-structured and unstructured data needs to be added to complete this picture.
Data is the new oil and is at the heart of every business. However, raw data by itself is not going to make a dent in a business. It is the useful insights that are generated from curated data that are the refined consumable oil that businesses aspire for. Data drives ML, which, in turn, gives businesses their competitive advantage. This is the age of digitization, where most successful businesses see themselves as tech companies first. Start-ups have the advantage of selecting the latest digital platforms while traditional companies are all undergoing digital transformations. Why should I care so much for the underlying data? I have highly qualified ML practitioners who are the unicorns of the industry that can use sophisticated algorithms and their special skill sets to make magic!
In this section, we established the importance of curating data since raw data by itself isn't going to make a dent in a business. In the next section, we will explore the influence that curated data has on the effectiveness of ML initiatives.
But isn't ML and AI all the rage today?
AI and ML are catchy buzzwords, and everybody wants to be on the bandwagon and use ML to differentiate their product. However, the hardest part about ML is not ML – it is managing everything else around ML creation. This is shown by Google in one of their papers in 2014 (https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf). Garbage in, garbage out, is true. The magic wand of ML will only work if the boxes surrounding it are well developed and most of them are data engineering tasks. In short, high-quality curated data is the foundational layer of any ML application, and the data engineering practices that curate this data are the backbone that holds it all together:
Figure 1.2 – The hardest part about ML is not ML, but rather everything else around it
Technologies come and go, so understanding the core challenges around data is critical. As technologists, we create more impact when we align solutions with business challenges. Speed to insights is what all businesses demand and the key to this is data. The data and IT functional areas within an organization that were traditionally viewed as cost centers are now being viewed as revenue-generating sources. Organizations where business and tech cooperate, instead of competing with each other, are the ones most likely to succeed with their data initiatives. Building data services and products involves several personas. In the next section, we will articulate the varying skill sets of these personas within an organization.