Amid the increasing recognition of data as a valuable corporate asset and the introduction of new technologies to store and process vast amounts of data, there has been an increase in the opportunities and roles available for data-related careers.
Let’s look at a sample use case where a sales manager for a consumer goods organization wants to better understand which alternate products a customer considers before purchasing their product. In addition, they also want to have a better way of predicting product demand by category based on external factors, such as the expected weather.
Achieving the desired outcomes as specified by the sales manager will require bringing in data from multiple internal and external sources. Datasets that could be relevant to this scenario may include the following:
- Customer, product, and order relational databases
- Web server logs from the consumer-facing storefront
- Third-party sales data from online marketplaces where relevant products are sold (such as Amazon.com)
- Other relevant third-party datasets that may influence sales (for example, weather-related data)
Multiple teams would need to be involved in the project, with the following 3 roles playing a primary part in implementing the required solution:
- Data engineer
- Data scientist
- Data analyst
Let’s take a look at how these three roles would contribute to this new project.
Understanding the role of the data engineer
The role of a data engineer is to do the following:
- Design, implement, and maintain the pipelines that enable the ingestion of raw data into a storage platform
- Transform that data to be optimized for analytics, based on data consumer requirements
- Make that data available for various data consumers using their tool of choice
A data engineer must also ensure that they comply with all required security and governance requirements while performing the above tasks.
In our scenario, the data engineer will first need to design the pipelines that ingest raw data from various internal and external sources. To achieve this, they will use a variety of tools, depending on the source system and whether it will be scheduled batch ingestion or near real-time streaming ingestion (as discussed in Chapter 6, Ingesting Batch and Streaming Data).
The data engineer is also responsible for transforming the raw input datasets to optimize them for analytics, using various techniques (as discussed in Chapter 7, Transforming Data to Optimize for Analytics). The data engineer must also create processes to verify the quality of data, add metadata about the data to a data catalog, and manage the lifecycle of code related to data transformation.
Finally, the data engineer may need to assist in integrating various data consumption tools with the transformed data, enabling data analysts and data scientists to use their preferred tools to draw insights from the data.
The data engineer uses tools such as Apache Kafka, Apache Spark, and Presto, as well as other commercially available products, to build the data pipeline and optimize data for analytics.
The data engineer is much like a civil engineer for a new residential development. The civil engineer is responsible for designing and building the roads, bridges, train stations, and so on to enable commuters to easily commute in and out of the development. In a similar way, the data engineer is responsible for designing and building the infrastructure required to bring data into a central source and for optimizing the data for use by various data consumers.
Understanding the role of the data scientist
The role of a data scientist is to draw complex insights and make predictions based on various datasets, using machine learning and artificial intelligence. The data scientist will combine a number of skills, including computer science, statistics, analytics, and math, in order to help an organization answer complex questions and make informed decisions using data.
Data scientists need to understand the raw data and know how to use that data to develop and train complex machine learning models that will help recognize patterns in the data and predict future trends. In our scenario, the data scientist may build a machine learning model that uses past sales data, correlated with weather information for each day in the reporting period.
They can then design and train this model to help business users get predictions on the likely top-selling categories for future dates based on the expected weather forecast (ice creams sell better on hot days, and umbrellas sell better on rainy days).
Where the data engineer is like a civil engineer building infrastructure for a new development, the data scientist is developing new and advanced products for the residents of the development, such as advanced types of transport in and out of the development. Data scientists create the machine learning models that enable data consumers and business analysts to draw new insights and predictions from data. However, much like the designer of a new airplane is dependent on having an airport where the plane can land and take off, the data scientist is dependent on data engineers creating data pipelines to bring in the data required to train new machine learning models.
Understanding the role of the data analyst
The role of a data analyst is to examine and combine multiple datasets in order to help a business understand trends in the data and to make more informed business decisions. While a data scientist develops models that make future predictions or identifies non-obvious patterns in data, the data analyst works with well-structured and modeled data to understand current conditions and to highlight recent patterns from the data.
A data analyst may answer questions such as which menu item sold best in different geographic regions over the past month, or which medical procedure had the best outcome for patients of different ages. These insights help an organization make better decisions for the future.
In our scenario, the data analyst may run complex queries against the different datasets that are available (such as an orders database or web server logs), joining together subsets of data from each source to gain new insights. For example, the data analyst may create a report highlighting which alternate products are most often browsed by a customer before a specific product is purchased. The data analyst may also make use of advanced machine learning models developed by the data scientists to gain further valuable insights.
Where the data engineer is like a civil engineer building infrastructure, and the data scientist is developing new, advanced forms of transportation, the data analyst is like a skilled pilot, using their expertise to get users to their end destination.
Understanding other common data-related roles
Organizations may have other role titles and job descriptions for data-related positions, but generally, these will be a subset of the roles described in the preceding sections.
For example, a big data architect or data platform architect could be a subset of the data engineer role, focused on designing the architecture for big data pipelines, but not building the data-specific pipelines. Or, a data visualization developer may be focused on building out visualizations using business intelligence tools, but this is effectively a subset of the data analyst role.
Larger organizations tend to have more focused job roles, while in a smaller organization, a single person may take on the role of data engineer, data scientist, and data analyst.
In this book, we will focus on the role of the data engineer, and dive deep into how a data engineer is able to build complex data pipelines using the power of cloud computing services. Let’s now look at how cloud computing has simplified how organizations are able to build and scale out big data processing solutions.