Exploring the Roles in Building Azure AI Solutions

This article is an excerpt from the book, Azure Data and AI Architect Handbook, by Olivier Mertens and Breght Van Baelen. Master core data architecture design concepts and Azure Data & AI services to gain a cloud data and AI architect’s perspective to developing end-to-end solutions

Introduction

Artificial Intelligence (AI) is transforming businesses across various industries rapidly. Especially w ith the surge in popularity of large language models such as ChatGPT, AI adoption is increasing exponentially. Microsoft Azure provides a wide range of AI services to help organizations build powerful AI solutions. In this chapter, we will explore the different AI services available on Azure, as well as the roles involved in building AI solutions, and the steps required to design, develop, and deploy AI models on Azure.

Specifically, we will cover the following:

The different roles involved in building AI solutions
The questions a data architect should ask when designing an AI solution

By the end of this article, you will have a good understanding of the role of the data architect in the world of data science. Additionally, you will have a high-level overview of what the data scientists and machine learning engineers are responsible for.

Knowing the roles in data science

The Azure cloud offers an extensive range of services for use in advanced analytics and data science. Before we dive into these, it is crucial to understand the different roles in the data science ecosystem. In previous chapters, while always looking through the lens of a data architect, we saw workloads that are typically operationalized by data engineers, database administrators, and data analysts.

Up until now, the chapters followed the journey of data through a data platform, from ingestion to raw storage to transformation, data warehousing, and eventually, visualization and dashboarding. The advanced analytics component is more separated from the entire solution, in the sense that most data architectures can perform perfectly without it. This does not take away from the fact that adding advanced analytics such as machine learning predictions can be a valuable enhancement to a solution.

The environment for advanced analytics introduces some new roles. The most prominent are the data scientist and the machine learning engineer, which we will look at in a bit more detail, starting with the following figure. Other profiles include roles such as data labelers and citizen data scientists.

exploring-the-roles-in-building-azure-ai-solutions-img-0

Figure 9.1 – An overview of the core components that each data role works with

Figure 9.1 shows a very simplified data solution with a machine learning component attached to it. This consists of a workspace to build and train machine learning models and virtual machine clusters to deploy them in production.

The data scientist is responsible for building and training the machine learning model. This is done through experimenting with data, most of the time stemming from the data lake. The data scientist will often use data from the bronze or silver tier in the data lake (i.e., the raw or semi-processed data). Data in the gold tier or the data warehouse is often transformed and aggregated in ways that make it convenient for business users to build reports with. However, the data scientist might want to perform different kinds of transformations, which focus more on the statistical relevance of certain features within the data to optimize the training performance of a machine learning model. Regardless, in some cases, data scientists will still interact with the gold layer and the data warehouse to pull clean data for experimentation.

Using this data, data scientists will perform exploratory data analysis (EDA) to get initial insights into the dataset. This is followed by data cleaning and feature engineering, where features are transformed or new features are derived to serve as input for the machine learning model. Next up, a model is trained and evaluated, resulting in a first prototype. The experimentation does not stop here, however, as machine learning models have hyperparameters that can be adjusted, which might lead to increased performance, while still using the same dataset. This last process is called hyperparameter tuning. Once this is completed, we will arrive at the cutoff point between the responsibilities of a data scientist and a machine learning engineer.

The machine learning engineer is responsible for the machine learning operations, often referred to as MLOps. Depending on the exact definition, this usually encompasses the later stages of the machine learning model life cycle. The machine learning engineer receives the finished model from the data scientist and creates a deployment for it. This will make the model available through an API so that it can be consumed by applications and users. In later stages, the model will need to be monitored and periodically retrained, until the end of its life cycle. This is a brief summary, but the MLOps process will be explained in more detail further in this chapter.

Next, Figure 9.2 provides an overview of the processes that take place in the MLOps cycle and who the primary contributor to each step is.

exploring-the-roles-in-building-azure-ai-solutions-img-1

Figure 9.2 – The steps of the data science workflow and their executors

Finally, what we are most interested in is the role of the cloud data architect in this environment. First, the architect has to think about the overall AI approach, part of which is deciding whether to go for custom development or not. We will dive deeper into strategy soon.

If custom machine learning model development is involved, the architect will have to decide on a data science environment, or workspace, where the data scientists can experiment.

However, the architect will have more involvement in the work of a machine learning engineer. The optimal working of MLOps is considerably more dependent on good architectural design than the typical prototyping done by data scientists. Here, the architect is responsible for deciding on deployment infrastructure, choosing the right monitoring solutions, version control for models, datasets, code, retraining strategies, and so on.

A lot of the value that an architect brings to machine learning projects comes from design choices outside of the data science suite. The data architect can greatly facilitate the work of data scientists by envisioning efficient data storing structures at the data lake level, with a strong focus on silver (and bronze) tiers with good data quality. Often, extra pipelines are required to get labeled data ready to be picked up by the data scientists.

Designing AI solutions

In this part, we will talk about the design of AI solutions, including qualification, strategy, and the responsible use of AI. Infusing AI into architecture has to be the result of some strategic consideration. The data architect should ask themself a series of questions, and find a substantiated answer, to end up with an optimal architecture.

The first set of questions is regarding the qualification of a use case.

Is AI the right solution?

This can be further related to the necessity of an inductive solution, compared to a deductive one. Business rulesets are deductive; machine learning is inductive. Business rules will provide you with a solid answer if the condition for that rule is met. Machine learning models will provide you with answers that have a high probability but not certain ones.

The big advantage of machine learning is its ability to cover cases in a much more granular manner, whereas business rules must group various cases within a single condition so as to not end up with an absurd or even impossible number of rules. Look at image recognition, for example. Trying to make a rule set for every possible combination of pixels that might represent a human is simply impossible. Knowing this, evaluate the proposed use case and confirm that the usage (and correlating costs) of AI is justified for this solution.

Do we opt for pre-trained models or a custom model?

Although this question is more focused on implementation than qualification, it is crucial t o answer it fi rst, as this will directly impact the following two questions. As with most things in the broader field of IT, it comes down to not reinventing the wheel. Does your use case sound like something generic or industry-agnostic? Th en there are probably existing machine learning models, often with far superior performance (general knowledge-wise) than your own data could train a model to have. Companies such as Microsoft and partners such as OpenAI invest heavily in getting these pre-trained models to cutting-edge standards.

It may be that the solution you want to create is fairly generic, but there are certain aspects that make it a bit more niche. An example could be a text analytics model in the medical industry. Text analytics models are great at the general skill of language understanding, but they might have some issues with grasping the essence of industry-specific language out of the box. In this case, an organization can provide some of its own data to fine-tune the model to increase its performance on niche tasks, while maintaining most of the general knowledge from its initial training dataset. Most of the pre-trained AI models on Azure, which reside in Azure Cognitive Services and Azure OpenAI Service, are fine tuneable. When out-of-the-box models are not an option, then we need to look at custom development.

Is data available?

If we opt for custom development, we will need to bring our own data. The same goes for wanting to fine-tune an existing model, yet to a lesser extent. Is the data that we need available? Does an organization have a significant volume of historical data stored already in a central location? If this data is still spread across multiple platforms or sources, then this might indicate it is not the right time to implement AI. It would be more valuable to focus on increased data engineering efforts in this situation. In the case of machine learning on Azure, data is ideally stored in tiers in Azure Data Lake Storage.

Keep in mind that machine learning model training does not stop after putting it into production. Th e performance of the production model will be constantly monitored, and if it starts to drift over time, retraining will take place. Do the sources of our current historical data still generate an adequate volume of data to carry out retraining?

In terms of data volume, there is still a common misunderstanding that large volumes of data are a necessity for any high-performant model. It’s key to know here that even though the performance of a model still scales with the amount of training data, more and more new techniques have been developed to allow for valuable performance levels to be reached with a limited data volume.

Is the data of acceptable quality?

Just like the last question, this only counts for custom development or fine-tuning. Data quality between sources can differ immensely. There are different ways in which data can be of bad quality. Some issues can be solved easily; others can be astonishingly hard. Some examples of poor data quality are as follows:

Inaccurate data: This occurs when data is incorrect or contains errors, such as typos or missing values. This is not easy to solve and will often result in fixes required at the source.
Incomplete data: This occurs when data is missing important information or lacks the necessary details to be useful. In some cases, data scientists can use statistics to impute missing data. In other cases, it might depend on the specific model that is being developed. Certain algorithms can perform well with sparse data, while others are heavily affected by it. Knowing which exact algorithms should not be in the scope of the architect but, rather, the data scientists.
Outdated data: This occurs when data is no longer relevant or useful due to changes in circumstances or the passage of time. If this data is statistically dissimilar to data generated in the present, it is better to remove this data from the training dataset.
Duplicated data: This occurs when the same data is entered multiple times in different places, leading to inconsistencies and confusion. Luckily, this is one of the easiest data quality issues to solve.
Biased data: This occurs when data is influenced by personal biases or prejudices, leading to inaccurate or unfair conclusions. This can be notoriously hard to solve and is a well-known issue in the data science world. We will come back to this later when discussing responsible AI.

This concludes the qualifying questions on whether to implement AI or not. There is one more important topic, namely the return on investment (ROI) of the addition, but to calculate the investment, we need to have more knowledge on the exact implementation. This will be the focus of the next set of questions.

Low code or code first?

The answer to which approach should be chosen depends on people, their skill sets, and the complexity of the use case. In the vast majority of cases, code-fi first solutions are preferred, as it comes with considerably more flexibility and versatility. Low code simplifies development a lot, often by providing drag and drop interfaces to create workflows (or, in this case, machine learning pipelines). While low-code solutions often benefit from rapid development, this advantage in speed is slowly shrinking. Due to advancements in libraries and packages, generic code-fi first models are also being developed in a shorter amount of time than before.

While code-first solutions cover a much broader set of use cases, they are simply not possible for every organization. Data scientists tend to be an expensive resource and are often fought over , with competition due to a lack of them in the labor market. Luckily, low-code platforms are advancing fast to address this issue. This allows citizen data scientists (non-professionals) to create and train machine learning models easily, although it will still yield inferior performance compared to professional code-first development.

As a rule of thumb, if a professional data science team is present and it has already been decided that custom development is the way forward, choose a code-fi rst solution.

What are the requirements for the AI model?

Now, we will dive deeper into the technicalities of machine learning models. Note that not all answers here must come from the data architect. It is certainly a plus if the architect can think about things such as model selection with the data scientists, but it is not expected of the role. Leave it to the data science and machine learning team to have a clear understanding of the technical requirements for the AI model and allow them to leverage their expertise.

The minimum accepted performance is probably the most straightforward. This is a defined threshold on the primary metric of a model, based on what is justifiable for the use case to progress. For instance, a model might need to have a minimum accuracy of 95% to be economically viable and continue toward production.

Next, latency is an important requirement when the model is used to make real-time predictions. The larger the model and the more calculations that need to happen (not counting parallelism), the longer it will take to make a prediction. Some use cases will require a prediction latency within milliseconds, which can be solved with lightweight model selection and specialized infrastructure.

Another requirement is the size of the model, which directly relates to the hosting costs when deployed into production, as the model will have to be loaded in RAM while the deployment runs. This is mostly a very binding requirement for IoT Edge use cases, where AI models are deployed on a small IoT device and make predictions locally before sending their results to the cloud. These devices often have very limited memory, and the data science team will have to figure out what the most efficient model is to fit on the device.

With the recently growing adoption of large language models (LLMs), such as the GPT-model family, power consumption has started to become an increasingly important topic as well. Years ago, this was a negligible topic in most use cases, but with the massive size of today’s cutting-edge models, it is unavoidable. Whether these models are hosted privately or in the cloud, power consumption will be an incurred cost directly or indirectly. For natural language use cases specifically, consider whether the traditional (and significantly cheaper) text analytics models in Azure Cognitive Services can do the job at an acceptable level before heading straight for LLMs.

Batch or real-time inferencing?

When a model is finished and ready for deployment, the architect will have to decide on the type of deployment. On a high level, we should decide whether the model will be used for either batch scoring or predicting in real-time.

Typically, when machine learning predictions are used to enrich data, which is already being batch processed in an OLAP scenario, the machine learning model can do periodical inferencing on large batches. The model will then be incorporated as an extra transformation step in the ETL pipeline. When using machine learning models in applications, for example, where users expect an instant prediction, real-time endpoints are required.

When deploying our model to an endpoint, the architecture might differ based on the type of inferencing, which we will look into in more depth later in this chapter.

Is explainability required?

Explainable AI, often referred to as XAI, has been on the rise for quite a while now. For traditional machine learning models, it was straightforward to figure out why a model came to which conclusion, through statistical methods such as feature importance. With the rise of deep learning models, which are essentially black-box models, we come across more and more predictions that cannot be explained.

Techniques have been developed to make an approximation of the decision-making process of a black box model. For instance, in the case of the mimic explainer, a traditional (and by nature interpretable) machine learning model is trained to mimic the black-box model and extract things, such as feature importance, from the mimic model. However, this is still an approximation and no guarantee.

Therefore, it is key to figure out how crucial explainability is for the use case. In cases that (heavily) affect humans, such as predicting credit scoring using AI, interpretability is a must. In cases with minimal or no impact on human lives, interpretability is more of a nice-to-have. In this instance, we can opt for a black-box model if this provides increased predictive performance.

What is the expected ROI?

When the qualifying questions have been answered and decisions have been made to fulfill technical requirements, we should have sufficient information to calculate an estimated ROI. This will be the final exercise before giving the green light to start implementation, or at least the development of a proof of concept.

If we know what approach to use, what kind of models to train, and which type of deployment to leverage, we can start mapping it to the right Azure service and perform a cost calculation. This is compared to the expected added value of a machine learning model.

Optimal performance of a machine learning model
As a side note to calculating the ROI, we need to have an idea of what the optimal performance level of a machine learning model is. This is where the academic and corporate worlds tend to differ. Academics focus on reaching the highest performance levels possible, whereas businesses will focus on the most efficient ratio between costs and performance. It might not make sense for a business to invest largely in a few percent increase in performance if this marginal increase is not justified by bringing adequate value to compensate.

Conclusion

This article is focused on data science and AI on Azure. We started by outlining the different roles involved in a data science team, including the responsibilities of data architects, engineers, scientists, and machine learning engineers, and how the collaboration between these roles is key to building successful AI solutions.

We then focused on the role of the data architect when designing an AI solution, outlining the questions they should ask themselves for a well-architected design.

Author Bio

Olivier Mertens is a cloud solution architect for Azure data and AI at Microsoft, based in Dublin, Ireland. In this role, he assisted organizations in designing their enterprise-scale data platforms and analytical workloads. Next to his role as an architect, Olivier leads the technical AI expertise for Microsoft EMEA in the corporate market. This includes leading knowledge sharing and internal upskilling, as well as solving highly complex or strategic customer AI cases. Before his time at Microsoft, he worked as a data scientist at a Microsoft partner in Belgium.
Olivier is a lecturer for generative AI and AI solution architectures, a keynote speaker for AI, and holds a master’s degree in information management, a postgraduate degree as an AI business architect, and a bachelor’s degree in business management.

Breght Van Baelen is a Microsoft employee based in Dublin, Ireland, and works as a cloud solution architect for the data and AI pillar in Azure. He provides guidance to organizations building large-scale analytical platforms and data solutions. In addition, Breght was chosen as an advanced cloud expert for Power BI and is responsible for providing technical expertise in Europe, the Middle East, and Africa. Before his time at Microsoft, he worked as a data consultant at Microsoft Gold Partners in Belgium.
Breght led a team of eight data and AI consultants as a data science lead. Breght holds a master’s degree in computer science from KU Leuven, specializing in AI. He also holds a bachelor’s degree in computer science from the University of Hasselt.