All ML projects are unique in some way: the organization, the data, the people, and the tools and techniques employed will never be exactly the same for any two projects. This is good, as it signifies progress as well as the natural variety that makes this such a fun space to work in.
That said, no matter the details, broadly speaking, all successful ML projects actually have a good deal in common. They require the translation of a business problem into a technical problem, a lot of research and understanding, proofs of concept, analyses, iterations, the consolidation of work, the construction of the final product, and its deployment to an appropriate environment. That is ML engineering in a nutshell!
Developing this a bit further, you can start to bucket these activities into rough categories or stages, the results of each being necessary inputs for later stages. This is shown in Figure 2.6 :
Figure 2.6: The stages that any ML project goes through as part of the ML development process.
Each category of work has a slightly different flavor, but taken together, they provide the backbone of any good ML project. The next few sections will develop the details of each of these categories and begin to show you how they can be used to build your ML engineering solutions. As we will discuss later, it is also not necessary for you to tackle your entire project in four steps like this; you can actually work through each of these steps for a specific feature or part of your overall project. This will be covered in the Selecting a software development methodology section.
Let’s make this a bit more real. The main focus and outputs of every stage can be summarized as shown in Table 2.1 :
Stage
Outputs
Discover
Clarity on the business question.
Clear arguments for ML over another approach.
Definition of the KPIs and metrics you want to optimize.
A sketch of the route to value.
Play
Detailed understanding of the data.
Working proof of concept.
Agreement on the model/algorithm/logic that will solve the problem.
Evidence that a solution is doable within realistic resource scenarios.
Evidence that good ROI can be achieved.
Develop
A working solution that can be hosted on appropriate and available infrastructure.
Thorough test results and performance metrics (for algorithms and software).
An agreed retraining and model deployment strategy.
Unit tests, integration tests, and regression tests.
Solution packaging and pipelines.
Deploy
A working and tested deployment process.
Provisioned infrastructure with appropriate security and performance characteristics.
Mode retraining and management processes.
An end-to-end working solution!
Table 2.1: The outputs of the different stages of the ML development process.
IMPORTANT NOTE
You may think that an ML engineer only really needs to consider the latter two stages, develop , and deploy , and that earlier stages are owned by the data scientist or even a business analyst. We will indeed focus mainly on these stages throughout this book and this division of labor can work very well. It is, however, crucially important that if you are going to build an ML solution, you understand all of the motivations and development steps that have gone before – you wouldn’t build a new type of rocket without understanding where you want to go first, would you?
Comparing this to CRISP-DM
The high-level categorization of project steps that we will outline in the rest of this chapter has many similarities to, and some differences from, an important methodology known as CRISP-DM. This methodology was published in 1999 and has since gathered a large following as a way to understand how to build any data project. In CRISP-DM, there are six different phases of activity, covering similar ground to that outlined in the four steps described in the previous section:
Business understanding : This is all about getting to know the business problem and domain area. This becomes part of the Discover phase in the four-step model.
Data understanding : Extending the knowledge of the business domain to include the state of the data, its location, and how it is relevant to the problem. Also included in the Discover phase.
Data preparation : Starting to take the data and transform it for downstream use. This will often have to be iterative. Captured in the Play stage.
Modeling : Taking the prepared data and then developing analytics on top of it; this could now include ML of various levels of sophistication. This is an activity that occurs both in the Play and Develop phases of the four-step methodology.
Evaluation : This stage is concerned with confirming whether the solution will meet the business requirements and performing a holistic review of the work that has gone before. This helps confirm if anything was overlooked or could be improved upon. This is very much part of the Develop and Deploy phases; in the methodology we will describe in this chapter, these tasks are very much more baked in across the project.
Deployment : In CRISP-DM, this was originally focused on deploying simple analytics solutions like dashboards or scheduled ETL pipelines that would run the decided-upon analytics models.
In the world of model ML engineering, this stage can represent, well, anything talked about in this book! CRISP-DM suggests sub-stages around planning and then reviewing the deployment.
As you can see from the list, many steps in CRISP-DM cover similar topics to those outlined in the four steps I propose. CRISP-DM is extremely popular across the data science community and so its merits are definitely appreciated by a huge number of data professionals across the world. Given this, you might be wondering, “Why bother developing something else then?” Let me convince you of why this is a good idea.
The CRISP-DM methodology is just another way to group the important activities of any data project in order to give them some structure. As you can perhaps see from the brief description of the stages I gave above and if you do further research, CRISP-DM has some potential drawbacks for use in a modern ML engineering project:
The process outlined in CRISP-DM is relatively rigid and quite linear. This can be beneficial for providing structure but might inhibit moving fast in a project.
The methodology is very big on documentation. Most steps detail writing some kind of report, review, or summary. Writing and maintaining good documentation is absolutely critical in a project but there can be a danger of doing too much.
CRISP-DM was written in a world before “big data” and large-scale ML. It is unclear to me whether its details still apply in such a different world, where classic extract-transform-load patterns are only one of so many.
CRISP-DM definitely comes from the data world and then tries to move toward the idea of a deployable solution in the last stage. This is laudable, but in my opinion, this is not enough. ML engineering is a different discipline in the sense that it is far closer to classic software engineering than not. This is a point that this book will argue time and again. It is therefore important to have a methodology where the concepts of deployment and development are aligned with software and modern ML techniques all the way through.
The four-step methodology attempts to alleviate some of these challenges and does so in a way that constantly makes reference to software engineering and ML skills and techniques. This does not mean that you should never use CRISP-DM in your projects; it might just be the perfect thing! As with many of the concepts introduced in this book, the important thing is to have many tools in your toolkit so that you can select the one most appropriate for the job at hand.
Given this, let’s now go through the four steps in detail.
Discover
Before you start working to build any solution, it is vitally important that you understand the problem you are trying to solve. This activity is often termed discovery in business analysis and is crucial if your ML project is going to be a success.
The key things to do during the discovery phase are the following:
Speak to the customer! And then speak to them again : You must understand the end user requirements in detail if you are to design and build the right system.
Document everything : You will be judged on how well you deliver against the requirements, so make sure that all of the key points from your discussion are documented and signed off by members of your team and the customer or their appropriate representative.
Define the metrics that matter : It is very easy at the beginning of a project to get carried away and to feel like you can solve any and every problem with the amazing new tool you are going to build. Fight this tendency as aggressively as you can, as it can easily cause major headaches later on. Instead, steer your conversations toward defining a single or very small number of metrics that define what success will look like.
Start finding out where the data lives! : If you can start working out what kind of systems you will have to access to get the data you need, this saves you time later and can help you find any major issues before they derail your project.
Using user stories
Once you have spoken to the customer (a few times), you can start to define some user stories . User stories are concise and consistently formatted expressions of what the user or customer wants to see and the acceptance criteria for that feature or unit of work. For example, we may want to define a user story based on the taxi ride example from Chapter 1 , Introduction to ML Engineering : “As a user of our internal web service, I want to see anomalous taxi rides and be able to investigate them further.”
Let’s begin!
To add this in Jira, select the Create button.
Next, select Story .
Then, fill in the details as you deem appropriate.
You have now added a user story to your work management tool! This allows you to do things such as create new tasks and link them to this user story or update its status as your project progresses:
Figure 2.7: An example user story in Jira.
The data sources you use are particularly crucial to understand. As you know, garbage in, garbage out , or even worse, no data, no go ! The particular questions you have to answer about the data are mainly centered around access , technology , quality , and relevance .
For access and technology, you are trying to pre-empt how much work the data engineers have to do to start their pipeline of work and how much this will hold up the rest of the project. It is therefore crucial that you get this one right.
A good example would be if you find out quite quickly that the main bulk of data you will need lives in a legacy internal financial system with no real modern APIs and no access request mechanism for non-finance team members. If its main backend is on-premises and you need to migrate locked-down financial data to the cloud, but this makes your business nervous, then you know you have a lot of work to do before you type a line of code. If the data already lives in an enterprise data lake that your team has access to, then you are obviously in a better position. Any challenge is surmountable if the value proposition is strong enough, but finding all this out early will save you time, energy, and money later on.
Relevance is a bit harder to find out before you kick off, but you can begin to get an idea. For example, if you want to perform the inventory forecast we discussed in Chapter 1 , Introduction to ML Engineering , do you need to pull in customer account information? If you want to create the classifier of premium or non-premium customers as marketing targets, also mentioned in Chapter 1 , Introduction to ML Engineering , do you need to have data on social media feeds? The question as to what is relevant will often be less clear-cut than for these examples but an important thing to remember is that you can always come back to it if you really missed something important. You are trying to capture the most important design decisions early, so common sense and lots of stakeholder and subject-matter expert engagement will go a long way.
Data quality is something that you can try to anticipate a little before moving forward in your project with some questions to current users or consumers of the data or those involved in its entry processes. To get a more quantitative understanding though, you will often just need to get your data scientists working with the data in a hands-on manner.
In the next section, we will look at how we develop proof-of-concept ML solutions in the most research-intensive phase, Play .
Play
In the play stage of the project, your aim is to work out whether solving the task even at the proof-of-concept level is feasible. To do this, you might employ the usual data science bread-and-butter techniques of exploratory data analysis and explanatory modeling we mentioned in the last chapter before moving on to creating an ML model that does what you need.
In this part of the process, you are not overly concerned with details of implementation, but with exploring the realms of possibility and gaining an in-depth understanding of the data and the problem, which goes beyond initial discovery work. Since the goal here is not to create production-ready code or to build reusable tools, you should not worry about whether or not the code you are writing is of the highest quality, or using sophisticated patterns. For example, it will not be uncommon to see code that looks something like the following examples (taken, in fact, from the repo for this book):
Figure 2.8: Some example prototype code that will be created during the play stage.
Even a quick glance at these screenshots tells you a few things:
The code is in a Jupyter notebook, which is run by a user interactively in a web browser.
The code sporadically calls methods to simply check or explore elements of the data (for example, df.head()
and df.dtypes
).
There is ad hoc code for plotting (and it’s not very intuitive!).
There is a variable called tmp
, which is not very descriptive.
All of this is absolutely fine in this more exploratory phase, but one of the aims of this book is to help you understand what is required to take code like this and make it into something suitable for your production ML pipelines. The next section starts us along this path.
Develop
As we have mentioned a few times already, one of the aims of this book is to get you thinking about the fact that you are building software products that just happen to have ML in them. This means a steep learning curve for some of us who have come from more mathematical and algorithmic backgrounds. This may seem intimidating but do not despair! The good news is that we can reuse a lot of the best practices and techniques honed through the software engineering community over several decades. There is nothing new under the sun.
This section explores several of those methodologies, processes, and considerations that can be employed in the development phase of our ML engineering projects.
Selecting a software development methodology
One of the first things we could and should shamelessly replicate as ML engineers is the software development methodologies that are utilized in projects across the globe. One category of these, often referred to as Waterfall , covers project workflows that fit quite naturally with the idea of building something complex (think a building or a car). In Waterfall methodologies, there are distinct and sequential phases of work, each with a clear set of outputs that are needed before moving on to the next phase. For example, a typical Waterfall project may have phases that broadly cover requirements-gathering, analysis, design, development, testing, and deployment (sound familiar?). The key thing is that in a Waterfall-flavored project, when you are in the requirements-gathering phase, you should only be working on gathering requirements, when in the testing phase, you should only be working on testing, and so on. We will discuss the pros and cons of this for ML in the next few paragraphs after introducing another set of methodologies.
The other set of methodologies, termed Agile , began its life after the introduction of the Agile Manifesto in 2001 (https://agilemanifesto.org/ ). At the heart of Agile development are the ideas of flexibility, iteration, incremental updates, failing fast, and adapting to changing requirements. If you are from a research or scientific background, this concept of flexibility and adaptability based on results and new findings may sound familiar.
What may not be so familiar to you if you have this type of scientific or academic background is that you can still embrace these concepts within a relatively strict framework that is centered around delivery outcomes. Agile software development methodologies are all about finding the balance between experimentation and delivery. This is often done by introducing the concepts of ceremonies (such as Scrums and Sprint Retrospectives ) and roles (such as Scrum Master and Product Owner ).
Further to this, within Agile development, there are two variants that are extremely popular: Scrum and Kanban . Scrum projects are centered around short units of work called Sprints where the idea is to make additions to the product from ideation through to deployment in that small timeframe. In Kanban, the main idea is to achieve a steady flow of tasks from an organized backlog into work in progress through to completed work.
All of these methodologies (and many more besides) have their merits and their detractions. You do not have to be married to any of them; you can chop and change between them. For example, in an ML project, it may make sense to do some post-deployment work that has a focus on maintaining an already existing service (sometimes termed a business-as-usual activity) such as further model improvements or software optimizations in a Kanban framework. It may make sense to do the main delivery of your core body of work in Sprints with very clear outcomes. But you can chop and change and see what fits best for your use cases, your team, and your organization.
But what makes applying these types of workflows to ML projects different? What do we need to think about in this world of ML that we didn’t before? Well, some of the key points are the following:
You don’t know what you don’t know : You cannot know whether you will be able to solve the problem until you have seen the data. Traditional software engineering is not as critically dependent on the data that will flow through the system as ML engineering is. We can know how to solve a problem in principle, but if the appropriate data does not exist in sufficient quantity or is of poor quality, then we can’t solve the problem in practice.
Your system is alive : If you build a classic website, with its backend database, shiny frontend, amazing load-balancing, and other features, then realistically, if the resource is there, it can just run forever. Nothing fundamental changes about the website and how it runs over time. Clicks still get translated into actions and page navigation still happens the same way. Now, consider putting some ML-generated advertising content based on typical user profiles in there. What is a typical user profile and does that change with time? With more traffic and more users, do behaviors that we never saw before become the new normal? Your system is learning all the time and that leads to the problems of model drift and distributional shift , as well as more complex update and rollback scenarios.
Nothing is certain : When building a system that uses rule-based logic, you know what you are going to get each and every time. If X , then Y means just that, always. With ML models, it is often much harder to know what the answer is when you ask the question, which is in fact why these algorithms are so powerful.
But it does mean that you can have unpredictable behavior, either for the reasons discussed previously or simply because the algorithm has learned something that is not obvious about the data to a human observer, or, because ML algorithms can be based on probabilistic and statistical concepts, results come attached to some uncertainty or fuzziness . A classic example is when you apply logistic regression and receive the probability of the data point belonging to one of the classes. It’s a probability so you cannot say with certainty that it is the case; just how likely it is! This is particularly important to consider when the outputs of your ML system will be leveraged by users or other systems to make decisions.
Given these issues, in the next section, we’ll try and understand what development methodologies can help us when we build our ML solutions. In Table 2.2 , we can see some advantages and disadvantages of each of these Agile methodologies for different stages and types of ML engineering projects:
Methodology
Pros
Cons
Agile
Flexibility is expected.
Faster dev to deploy cycles.
If not well managed, can easily have scope drift.
Kanban or Sprints may not work well for some projects.
Waterfall
Clearer path to deployment.
Clear staging and ownership of tasks.
Lack of flexibility.
Higher admin overheads.
Table 2.2: Agile versus Waterfall for ML development.
Let’s move on to the next section!
Package management (conda and pip)
If I told you to write a program that did anything in data science or ML without using any libraries or packages and just pure Python, you would probably find this quite difficult to achieve in any reasonable amount of time, and incredibly boring! This is a good thing. One of the really powerful features of developing software in Python is that you can leverage an extensive ecosystem of tools and capabilities relatively easily. The flip side of this is that it would be very easy for managing the dependencies of your code base to become a very complicated and hard-to-replicate task. This is where package and environment managers such as pip
and conda
come in.
pip
is the standard package manager in Python and the one recommended for use by the Python Package Authority.
It retrieves and installs Python packages from PyPI
, the Python Package Index
. pip
is super easy to use and is often the suggested way to install packages in tutorials and books.
conda
is the package and environment manager that comes with the Anaconda and Miniconda Python distributions. A key strength of conda
is that although it comes from the Python ecosystem, and it has excellent capabilities there, it is actually a more general package manager. As such, if your project requires dependencies outside Python (the NumPy and SciPy libraries being good examples), then although pip
can install these, it can’t track all the non-Python dependencies, nor manage their versions. With conda
, this is solved.
You can also use pip
within conda
environments, so you can get the best of both worlds or use whatever you need for your project. The typical workflow that I use is to use conda
to manage the environments I create and then use that to install any packages I think may require non-Python dependencies that perhaps are not captured well within pip
, and then I can use pip
most of the time within the created conda
environment. Given this, throughout the book, you may see pip
or conda
installation commands used interchangeably. This is perfectly fine.
To get started with Conda, if you haven’t already, you can download the Individual distribution installer from the Anaconda website (https://www.anaconda.com/products/individual ). Anaconda comes with some Python packages already installed, but if you want to start from a completely empty environment, you can download Miniconda from the same website instead (they have the exact same functionality; you just start from a different base).
The Anaconda documentation is very helpful for getting you up to speed with the appropriate commands, but here is a quick tour of some of the key ones.
First, if we want to create a conda
environment called mleng
with Python version 3.8 installed, we simply execute the following in our terminal:
conda env --name mleng python=3.10
We can then activate the conda
environment by running the following:
source activate mleng
This means that any new conda
or pip
commands will install packages in this environment and not system-wide.
We often want to share the details of our environment with others working on the same project, so it can be useful to export all the package configurations to a .yml
file:
conda export env > environment.yml
The GitHub repository for this book contains a file called mleng-environment.yml
for you to create your own instance of the mleng
environment. The following command creates an environment with this configuration using this file:
conda env create --file environment.yml
This pattern of creating a con
da
environment from an environment file is a nice way to get your environments set up for running the examples in each of the chapters in the book. So, the Technical requirements section in each chapter will point to the name of the correct environment YAML file contained in the book’s repository.
These commands, coupled with your classic conda
or pip install
command, will set you up for your project quite nicely!
conda install <package-name>
Or
pip install <package-name>
I think it’s always a good practice to have many options for doing something, and in general, this is good engineering practice. So given that, now that we have covered the classic Python environment and package managers in conda
and pip
, we will cover one more package manager. This is a tool that I like for its ease of use and versatility. I think it provides a nice extension of the capabilities of conda
and pip
and can be used to complement them nicely. This tool is called Poetry and it is what we turn to now.
Poetry
Poetry is another package manager that has become very popular in recent years. It allows you to manage your project’s dependencies and package information into a single configuration file in a similar way to the environment YAML file we discussed in the section on Conda. Poetry’s strength lies in its far superior ability to help you manage complex dependencies and ensure “deterministic” builds, meaning that you don’t have to worry about the dependency of a package updating in the background and breaking your solution. It does this via the use of “lock files” as a core feature, as well as in-depth dependency checking. This means that reproducibility can often be easier in Poetry. It is important to call out that Poetry is focused on Python package management specifically, while Conda can also install and manage other packages, for example, C++ libraries. One way to think of Poetry is that it is like an upgrade of the pip
Python installation package, but one that also has some environment management capability. The next steps will explain how to set up and use Poetry for a very basic use case.
We will build on this with some later examples in the book. First, follow these steps:
First, as usual, we will install Poetry:
pip install poetry
After Poetry is installed, you can create a new project using the poetry new
command, followed by the name of your project:
poetry new mleng-with-python
This will create a new directory named mleng-with-python
with the necessary files and directories for a Python project. To manage your project’s dependencies, you can add them to the pyproject.toml
file in the root directory of your project. This file contains all of the configuration information for your project, including its dependencies and package metadata.
For example, if you are building a ML project and want to use the scikit-learn
library, you would add the following to your pyproject.toml
file:
[tool.poetry.dependencies]
scikit-learn = "*"
You can then install the dependencies for your project by running the following command. This will install the scikit-learn
library and any other dependencies specified in your pyproject.toml
file:
poetry install
To use a dependency in your project, you can simply import it in your Python code like so:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
As you can see, getting started with Poetry is very easy. We will return to using Poetry throughout the book in order to give you examples that complement the knowledge of Conda that we will develop. Chapter 4 , Packaging Up , will discuss this in detail and will show you how to get the most out of Poetry.
Code version control
If you are going to write code for real systems, you are almost certainly going to do it as part of a team. You are also going to make your life easier if you can have a clean audit trail of changes, edits, and updates so that you can see how the solution has developed. Finally, you are going to want to cleanly and safely separate out the stable versions of the solution that you are building and that can be deployed versus more transient developmental versions. All of this, thankfully, is taken care of by source code version control systems, the most popular of which is Git .
We will not go into how Git works under the hood here (there are whole books on the topic!) but we will focus on understanding the key practical elements of using it:
You already have a GitHub account from earlier in the chapter, so the first thing to do is to create a repository with Python as the language and initialize README.md
and .gitignore
files. The next thing to do is to get a local copy of this repository by running the following command in Bash, Git Bash, or another terminal:
git clone <repo-name>
Now that you have done this, go into the README.md
file and make some edits (anything will do). Then, run the following commands to tell Git to monitor this file and to save your changes locally with a message briefly explaining what these are:
git add README.md
git commit -m "I've made a nice change …"
This now means that your local Git instance has stored what you’ve changed and is ready to share that with the remote repo.
You can then incorporate these changes into the main
branch by doing the following:
git push origin main
If you now go back to the GitHub site, you will see that the changes have taken place in your remote repository and that the comments you added have accompanied the change.
Other people in your team can then get the updated changes by running the following:
git pull origin main
These steps are the absolute basics of Git and there is a ton more you can learn online. What we will do now, though, is start setting up our repo and workflow in a way that is relevant to ML engineering.
Git strategies
The presence of a strategy for using version control systems can often be a key differentiator between the data science and ML engineering aspects of a project. It can sometimes be overkill to define a strict Git strategy for exploratory and basic modeling stages (Discover and Play ) but if you want to engineer something for deployment (and you are reading this book, so this is likely where your head is at), then it is fundamentally important.
Great, but what do we mean by a Git strategy?
Well, let’s imagine that we just try to develop our solution without a shared direction on how to organize the versioning and code.
ML engineer A wants to start building some of the data science code into a Spark ML pipeline (more on this later) so creates a branch from main
called pipeline1spark
:
git checkout -b pipeline1spark
They then get to work on the branch and writes some nice code in a new file called pipeline.py
:
tokenizer = Tokenizer(inputCol="text" , outputCol="words" )
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),
outputCol="features" )
lr = LogisticRegression(maxIter=10 , regParam=0.001 )
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
Great, they’ve made some excellent progress in translating some previous sklearn
code into Spark, which was deemed more appropriate for the use case. They then keep working in this branch because it has all of their additions, and they think it’s better to do everything in one place. When they want to push the branch to the remote repository, they run the following commands:
git push origin pipeline1spark
ML engineer B comes along, and they want to use ML engineer A ’s pipeline code and build some extra steps around it. They know engineer A ’s code has a branch with this work, so they know enough about Git to create another branch with A ’s code in it, which B calls pipeline
:
git pull origin pipeline1spark
git checkout pipeline1spark
git checkout -b pipeline
They then add some code to read the parameters for the model from a variable:
lr = LogisticRegression(maxIter=model_config["maxIter" ],
regParam=model_config["regParam" ])
Cool, engineer B has made an update that is starting to abstract away some of the parameters. They then push their new branch to the remote repository:
git push origin pipeline
Finally, ML engineer C joins the team and wants to get started on the code. Opening up Git and looking at the branches, they see there are three:
main
pipeline1spark
pipeline
So, which one should be taken as the most up to date? If they want to make new edits, where should they branch from? It isn’t clear, but more dangerous than that is if they are tasked with pushing deployment code to the execution environment, they may think that main
has all the relevant changes. On a far busier project that’s been going on for a while, they may even branch off from main
and duplicate some of B and C ’s work! In a small project, you would waste time going on this wild goose chase; in a large project with many different lines of work, you would have very little chance of maintaining a good workflow:
lr = LogisticRegression(maxIter=10 , regParam=0.001 )
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
lr = LogisticRegression(maxIter=model_config["maxIter" ],
regParam=model_config[" regParam" ])
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
If these commits both get pushed to the main
branch at the same time, then we will get what is called a merge conflict , and in each case, the engineer will have to choose which piece of code to keep, the current or new example. This would look something like this if engineer A pushed their changes to main
first:
<<<<<<< HEAD
lr = LogisticRegression(maxIter=10 , regParam=0.001 )
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
=======
lr = LogisticRegression(maxIter=model_config["maxIter" ],
regParam=model_config["regParam" ])
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
>>>>>>> pipeline
The delimiters in the code show that there has been a merge conflict and that it is up to the developer to select which of the two versions of the code they want to keep.
IMPORTANT NOTE
Although, in this simple case, we could potentially trust the engineers to select the better code, allowing situations like this to occur very frequently is a huge risk to your project. This not only wastes a huge amount of precious development time but it could also mean that you actually end up with worse code!
The way to avoid confusion and extra work like this is to have a very clear strategy for the use of the version control system in place, such as the one we will now explore.
The Gitflow workflow
The biggest problem with the previous example was that all of our hypothetical engineers were actually working on the same piece of code in different places. To stop situations like this, you have to create a process that your team can all follow – in other words, a version control strategy or workflow.
One of the most popular of these strategies is the Gitflow workflow . This builds on the basic idea of having branches that are dedicated to features and extends it to incorporate the concept of releases and hotfixes, which are particularly relevant to projects with a continuous deployment element.
The main idea is we have several types of branches, each with clear and specific reasons for existing:
Main contains your official releases and should only contain the stable version of your code.
Dev acts as the main point for branching from and merging to for most work in the repository; it contains the ongoing development of the code base and acts as a staging area before main
.
Feature branches should not be merged straight into the main
branch; everything should branch off from dev
and then be merged back into dev
.
Release branches are created from dev
to kick off a build or release process before being merged into main
and dev
and then deleted.
Hotfix branches are for removing bugs in deployed or production software. You can branch this from main
before merging into main
and dev
when done.
This can all be summarized diagrammatically as in Figure 2.9 , which shows how the different branches contribute to the evolution of your code base in the Gitflow workflow:
Figure 2.9: The Gitflow workflow.
This diagram is taken from https://lucamezzalira.com/2014/03/10/git-flow-vs-github-flow/ . More details can be found at https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow .
If your ML project can follow this sort of strategy (and you don’t need to be completely strict about this if you want to adapt it), you will likely see a drastic improvement in productivity, code quality, and even documentation:
Figure 2.10: Example code changes upon a pull request in GitHub.
One important aspect we haven’t discussed yet is the concept of code reviews. These are triggered in this process by what is known as a pull request , where you make known your intention to merge into another branch and allow another team member to review your code before this executes. This is the natural way to introduce code review to your workflow. You do this whenever you want to merge your changes and update them into dev or main branches. The proposed changes can then be made visible to the rest of the team, where they can be debated and iterated on with further commits before completing the merge.
This enforces code review to improve quality, as well as creating an audit trail and safeguards for updates. Figure 2.10 shows an example of how changes to code are made visible for debate during a pull request in GitHub.
Now that we have discussed some of the best practices for applying version control to your code, let’s explore how to version control the models you produce during your ML project.
Model version control
In any ML engineering project, it is not only code changes that you have to track clearly but also changes in your models. You want to track changes not only in the modeling approach but also in performance when new or different data is fed into your chosen algorithms. One of the best tools for tracking these kinds of changes and providing version control of ML models is MLflow , an open-source platform from Databricks under the stewardship of the Linux Foundation.
To install MLflow, run the following command in your chosen Python environment:
pip install mlflow
The main aim of MLflow is to provide a platform via which you can log model experiments, artifacts, and performance metrics. It does this through some very simple APIs provided by the Python mlflow
library, interfaced to selected storage solutions through a series of centrally developed and community plugins. It also comes with functionality for querying, analyzing, and importing/exporting data via a Graphical User Interface (GUI ), which will look something like Figure 2.11 :
Figure 2.11: The MLflow tracking server UI with some forecasting runs.
The library is extremely easy to use. In the following example, we will take the sales forecasting example from Chapter 1 , Introduction to ML Engineering , and add some basic MLflow functionality for tracking performance metrics and saving the trained Prophet model:
First, we make the relevant imports, including MLflow’s pyfunc
module, which acts as a general interface for saving and loading models that can be written as Python functions. This facilitates working with libraries and tools not natively supported in MLflow (such as the fbprophet
library):
import pandas as pd
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
import mlflow
import mlflow.pyfunc
To create a more seamless integration with the forecasting models from fbprophet
, we define a small wrapper class that inherits from the mlflow.pyfunc.PythonModel
object:
class FbProphetWrapper (mlflow.pyfunc.PythonModel):
def __init__ (self, model ):
self.model = model
super ().__init__()
def load_context (self, context ):
from fbprophet import Prophet
return
def predict (self, context, model_input ):
future = self.model.make_future_dataframe(
periods=model_input["periods" ][0 ])
return self.model.predict(future)
We now wrap the functionality for training and prediction into a single helper function called train_predict()
to make running multiple times simpler. We will not define all of the details inside this function here but let’s run through the main pieces of MLflow functionality contained within it.
First, we need to let MLflow know that we are now starting a training run we wish to track:
with mlflow.start_run():
Inside this loop, we then define and train the model, using parameters defined elsewhere in the code:
model = Prophet(
yearly_seasonality=seasonality_params['yearly' ],
weekly_seasonality=seasonality_params[' weekly' ],
daily_seasonality=seasonality_params['daily' ]
)
model.fit(df_train)
We then perform some cross-validation to calculate some metrics we would like to log:
df_cv = cross_validation(model, initial="730 days" ,
period="180 days" , horizon="365 days" )
df_p = performance_metrics(df_cv)
We can log these metrics, for example, the Root Mean Squared Error (RMSE ) here, to our MLflow server:
mlflow.log_metric("rmse" , df_p.loc[0 , "rmse" ])
Then finally, we can use our model wrapper class to log the model and print some information about the run:
mlflow.pyfunc.log_model("model" , python_model=FbProphetWrapper(model))
print (
"Logged model with URI: runs:/{run_id}/model" .format (
run_id=mlflow.active_run().info.run_id
)
)
With only a few extra lines, we have started to perform version control on our models and track the statistics of different runs!
There are many different ways to save the ML model you have built to MLflow (and in general), which is particularly important when tracking model versions. Some of the main options are as follows:
joblib : joblib
is a general-purpose pipelining library in Python that is very powerful but lightweight. It has a lot of really useful capabilities centered around caching, parallelizing, and compression that make it a very versatile tool for saving and reading in your ML pipelines. It is also particularly fast for storing large NumPy
arrays, so is useful for data storage. We will use joblib
more in later chapters. It is important to note that joblib
suffers from the same security issues as pickle
, so knowing the lineage of your joblib
files is incredibly important.
JSON : If pickle
and joblib
aren’t appropriate, you can serialize your model and its parameters in JSON format. This is good because JSON is a standardized text serialization format that is commonly used across a variety of solutions and platforms. The caveat to using JSON serialization of your models is that you often have to manually define the JSON structure with the relevant parameters you want to store. So, it can create a lot of extra work. Several ML libraries in Python have their own export to JSON functionality, for example, the deep learning package Keras, but they can all result in quite different formats.
MLeap : MLeap is a serialization format and execution engine based on the Java Virtual Machine (JVM ). It has integrations with Scala, PySpark, and Scikit-Learn but you will often see it used in examples and tutorials for saving Spark pipelines, especially for models built with Spark ML. This focus means it is not the most flexible of formats but is very useful if you are working in the Spark ecosystem .
ONNX : The Open Neural Network Exchange (ONNX ) format is aimed at being completely cross-platform and allowing the exchange of models between the main ML frameworks and ecosystems. The main downside of ONNX is that (as you can guess from the name) it is mainly aimed at neural network-based models, with the exception of its scikit-learn
API. It is an excellent option if you are building a neural network though.
In Chapter 3 , From Model to Model Factory , we will export our models to MLflow using some of these formats, but they are all compatible with MLflow and so you should feel comfortable using them as part of your ML engineering workflow.
The final section of this chapter will introduce some important concepts for planning how you wish to deploy your solution, prefacing more detailed discussions later in the book.
Deploy
The final stage of the ML development process is the one that really matters: how do you get the amazing solution you have built out into the real world and solve your original problem? The answer has multiple parts, some of which will occupy us more thoroughly later in this book but will be outlined in this section. If we are to successfully deploy our solution, first of all, we need to know our deployment options: what infrastructure is available and is appropriate for the task? We then need to get the solution from our development environment onto this production infrastructure so that, subject to appropriate orchestration and controls, it can execute the tasks we need it to and surface the results where it has to. This is where the concepts of DevOps and MLOps come into play.
Let’s elaborate on these two core concepts, laying the groundwork for later chapters and exploring how to begin deploying our work.
Knowing your deployment options
In Chapter 5 , Deployment Patterns and Tools , we will cover in detail what you need to get your ML engineering project from the develop to deploy stage, but to pre-empt that and provide a taster of what is to come, let’s explore the different types of deployment options we have at our disposal:
On-premises deployment : The first option we have is to ignore the public cloud altogether and deploy our solutions in-house on owned infrastructure. This option is particularly popular and necessary for a lot of large institutions with a lot of legacy software and strong regulatory constraints on data location and processing. The basic steps for deploying on-premises are the same as deploying on the cloud but often require a lot more involvement from other teams with particular specialties. For example, if you are in the cloud, you often do not need to spend a lot of time configuring networking or implementing load balancers, whereas on-premises solutions will require these.
The big advantage of on-premises deployment is security and peace of mind that none of your data is going to traverse your company firewall. The big downsides are that it requires a larger investment upfront for hardware and that you have to expend a lot of effort to successfully configure and manage that hardware effectively. We will not be discussing on-premises deployment in detail in this book, but all of the concepts we will employ around software development, packaging, environment management, and training and prediction systems still apply.
Infrastructure-as-a-Service (IaaS ): If you are going to use the cloud, one of the lowest levels of abstraction you have access to for deployment is IaaS solutions. These are typically based on the concept of virtualization, such that servers with a variety of specifications can be spun up at the user’s will. These solutions often abstract away the need for maintenance and operations as part of the service. Most importantly, they allow extreme scalability of your infrastructure as you need it. Have to run 100 more servers next week? No problem, just scale up your IaaS request and there it is. Although IaaS solutions are a big step up from fully managed on-premises infrastructure, there are still several things you need to think about and configure. The balance in cloud computing is always over how easy you want things to be versus what level of control you want to have. IaaS maximizes control but minimizes (relative) ease compared to some other solutions. In AWS , Simple Storage Service (S3 ) and Elastic Compute Cloud (EC2 ) are good examples of IaaS offerings.
Platform-as-a-Service (PaaS ): PaaS solutions are the next level up in terms of abstraction and usually provide you with a lot of capabilities without needing to know exactly what is going on under the hood. This means you can focus solely on the development tasks that the platform is geared up to support, without worrying about underlying infrastructure at all. One good example is AWS Lambda functions, which are serverless functions that can scale almost without limit.
All you are required to do is enter the main piece of code you want to execute inside the function. Another good example is Databricks , which provides a very intuitive UI on top of the Spark cluster infrastructure, with the ability to provision, configure, and scale up these clusters almost seamlessly.
Being aware of these different options and their capabilities can help you design your ML solution and ensure that you focus your team’s engineering effort where it is most needed and will be most valuable. If your ML engineer is working on configuring routers, for example, you have definitely gone wrong somewhere.
But once you have selected the components you’ll use and provisioned the infrastructure, how do you integrate these together and manage your deployment and update cycles? This is what we will explore now.
Understanding DevOps and MLOps
A very powerful idea in modern software development is that your team should be able to continuously update your code base as needed, while testing, integrating, building, packaging, and deploying your solution should be as automated as possible. This then means these processes can happen on an almost continual basis without big pre-planned buckets of time being assigned to update cycles. This is the main idea behind CI/CD . CI/CD is a core part of DevOps and its ML-focused cousin MLOps , which both aim to bring together software development and post-deployment operations. Several of the concepts and solutions we will develop in this book will be built up so that they naturally fit within an MLOps framework.
The CI part is mainly focused on the stable incorporation of ongoing changes to the code base while ensuring functionality remains stable. The CD part is all about taking the resultant stable version of the solution and pushing it to the appropriate infrastructure.
Figure 2.12 shows a high-level view of this process:
Figure 2.12: A high-level view of CI/CD processes.
In order to make CI/CD a reality, you need to incorporate tools that help automate tasks that you would traditionally perform manually in your development and deployment process. For example, if you can automate the running of tests upon merging of code, or the pushing of your code artifacts/models to the appropriate environment, then you are well on your way to CI/CD.
We can break this out further and think of the different types of tasks that fall into the DevOps or MLOps lifecycles for a solution. Development tasks will typically cover all of the activities that take you from a blank screen on your computer to a working piece of software. This means that development is where you spend most of your time in a DevOps or MLOps project. This covers everything from writing the code to formatting it correctly and testing it.
Table 2.3 splits out these typical tasks and provides some details on how they build on each other, as well as typical tools you could use in your Python stack for enabling them.
Lifecycle Stage
Activity
Details
Tools
Dev
Testing
Unit tests: tests aimed at testing the functionality smallest pieces of code.
pytest or unittest
Integration tests: ensure that interfaces within the code and to other solutions work.
Selenium
Acceptance tests: business focused tests.
Behave
UI tests: ensuring any frontends behave as expected.
Linting
Raise minor stylistic errors and bugs.
flake8 or bandit
Formatting
Enforce well-formatted code automatically.
black or sort
Building
The final stage of bringing the solution together.
Docker, twine, or pip
Table 2.3: Details of the development activities carried out in any DevOps or MLOps project.
Next, we can think about the ML activities within MLOps, which this book will be very concerned with. This covers all of the tasks that a classic Python software engineer would not have to worry about, but that are crucially important to get right for ML engineers like us. This includes the development of capabilities to automatically train the ML models, to run the predictions or inferences the model should generate, and to bring that together inside code pipelines. It also covers the staging and management of the versions of your models, which heavily complements the idea of versioning your application code, as we do using tools like Git. Finally, an ML engineer also has to consider that they have to build out specific monitoring capabilities for the operational mode of their solution, which is not covered in traditional DevOps workflows. For an ML solution, you may have to consider monitoring things like precision, recall, the f1-score, population stability, entropy, and data drift in order to know if the model component of your solution is behaving within a tolerable range. This is very different from classic software engineering as it requires a knowledge of how ML models work, how they can go wrong, and a real appreciation of the importance of data quality to all of this. This is why ML engineering is such an exciting place to be! See Table 2.4 for some more details on these types of activities.
Lifecycle Stage
Activity
Details
Tools
ML
Training
Train the model .
Any ML package.
Predicting
Run the predictions or inference steps.
Any ML package.
Building
Creating the pipelines and application logic in which the model is embedded.
sklearn pipelines, Spark ML pipelines, ZenML.
Staging
Tag and release the appropriate version of your models and pipelines.
MLflow or Comet.ml.
Monitoring
Track the solution performance and raise alerts when necessary.
Seldon, Neptune.ai, Evidently.ai, or Arthur.ai.
Table 2.4: Details on the ML-centered activities carried out during an MLOps project.
Finally, in either DevOps or MLOps, there is the Ops piece, which refers to Operations. This is all about how the solution will actually run, how it will alert you if there is an issue, and if it can recover successfully. Naturally then, operations will cover activities relating to the final packaging, build, and release of your solution. It also has to cover another type of monitoring, which is different from the performance monitoring of ML models. This monitoring has more of a focus on infrastructure utilization, stability, and scalability, on solution latency, and on the general running of the wider solution. This part of the DevOps and MLOps lifecycle is quite mature in terms of tooling, so there are many options available. Some information to get you started is presented in Table 2.5 .
Lifecycle Stage
Activity
Details
Tools
Ops
Releasing
Taking the software you have built and storing it somewhere central for reuse.
Twine, pip, GitHub, or BitBucket.
Deploying
Pushing the software you have built to the appropriate target location and environment.
Docker, GitHub Actions, Jenkins, TravisCI, or CircleCI.
Monitoring
Tracking the performance and utilization of the underlying infrastructure and general software performance, alerting where necessary.
DataDog, Dynatrace, or Prometheus.
Table 2.5: Details of the activities carried out in order to make a solution operational in a DevOps or MLOps project.
Now that we have elucidated the core concepts needed across the MLOps lifecycle, in the next section, we will discuss how to implement CI/CD practices so that we can start making this a reality in our ML engineering projects. We will also extend this to cover automated testing of the performance of your ML models and pipelines, and to perform automated retraining of your ML models.
Building our first CI/CD example with GitHub Actions
We will use GitHub Actions as our CI/CD tool in this book, but there are several other tools available that do the same job. GitHub Actions is available to anyone with a GitHub account, has a very useful set of documentation, https://docs.github.com/en/actions , and is extremely easy to start using, as we will show now.
When using GitHub Actions, you have to create a .yml
file that tells GitHub when to perform the required actions and, of course, what actions to perform. This .yml
file should be put in a folder called .github/workflows
in the root directory of your repository. You will have to create this if it doesn’t already exist. We will do this in a new branch called feature/actions
. Create this branch by running:
git checkout –b feature/actions
Then, create a .yml
file called github-actions-basic.yml
. In the following steps, we will build up this example .yml
file for a Python project where we automatically install dependencies, run a linter (a solution to check for bugs, syntax errors, and other issues), and then run some unit tests. This example comes from the GitHub Starter Workflows repository (https://github.com/actions/starter-workflows/blob/main/ci/python-package-conda.yml ). Open up github-actions-basic.yml
and then execute the following:
First, you define the name of the GitHub Actions workflow and what Git event will trigger it:
name: Python package
on: [push ]
You then list the jobs you want to execute as part of the workflow, as well as their configuration. For example, here we have one job called build
, which we want to run on the latest Ubuntu distribution, and we want to attempt the build using several different versions of Python:
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.9 , 3.10 ]
You then define the steps that execute as part of the job. Each step is separated by a hyphen and is executed as a separate command. It is important to note that the uses
keyword grabs standard GitHub Actions; for example, in the first step, the workflow uses the v2 version of the checkout
action, and the second step sets up the Python versions we want to run in the workflow:
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
The next step installs the relevant dependencies for the solution using pip
and a requirements.txt
file (but you can use conda
of course!):
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
We then run some linting:
- name: Lint with flake8
run: |
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-
length=127 --statistics
Finally, we run our tests using our favorite Python testing library. For this step, we do not want to run through the entire repository, as it is quite complex, so for this example, we use the working-directory
keyword to only run pytest
in that directory.
Since it contains a simple test function in test_basic.py
, this will automatically pass:
- name: Test with pytest
run: pytest
working-directory: Chapter02
We have now built up the GitHub Actions workflow; the next stage is to show it running. This is taken care of automatically by GitHub, all you have to do is push to the remote repository. So, add the edited .yml
file, commit it, and then push it:
git add .github/workflows/github-actions-basic.yml
git commit –m "Basic CI run with dummy test"
git push origin feature/actions
After you have run these commands in the terminal, you can navigate to the GitHub UI and then click on Actions in the top menu bar. You will then be presented with a view of all action runs for the repository like that in Figure 2.13.
Figure 2.13: The GitHub Actions run as viewed from the GitHub UI.
If you then click on the run, you will be presented with details of all jobs that ran within the Actions run, as shown in Figure 2.14 .
Figure 2.14: GitHub Actions run details from the GitHub UI.
Finally, you can go into each job and see the steps that were executed, as shown in Figure 2.15 . Clicking on these will also show the outputs from each of the steps. This is extremely useful for analyzing any failures in the run.
Figure 2.15: The GitHub Actions run steps as shown on the GitHub UI.
What we have shown so far is an example of CI. For this to be extended to cover CD, we need to include steps that push the produced solution to its target host destination. Examples are building a Python package and publishing it to pip
, or creating a pipeline and pushing it to another system for it to be picked up and run. This latter example will be covered with an Airflow DAG in Chapter 5 , Deployment Patterns and Tools . And that, in a nutshell, is how you start building your CI/CD pipelines. As mentioned, later in the book, we will build workflows specific to our ML solutions.
Now we will look at how we take CI/CD concepts to the next level for ML engineering and build some tests for our model performance, which can then also be triggered as part of continuous processes.
Continuous model performance testing
As ML engineers, we not only care about the core functional behavior of the code we are writing; we also have to care about the models that we are building, This is an easy thing to forget, as traditional software projects do not have to consider this component.
The process I will now walk you through shows how you can take some base reference data and start to build up some different flavors of tests to give confidence that your model will perform as expected when you deploy it.
We have already introduced how to test automatically with Pytest and GitHub Actions, the good news is that we can just extend this concept to include the testing of some model performance metrics. To do this, you need a few things in place:
Within the action or tests, you need to retrieve the reference data for performing the model validation. This can be done by pulling from a remote data store like an object store or a database, as long as you provide the appropriate credentials. I would suggest storing these as secrets in Github. Here, we will use a dataset generated in place using the sklearn
library as a simple example.
You need to retrieve the model or models you wish to test from some location as well. This could be a full-fledged model registry or some other storage mechanism. The same points around access and secrets management as in point 1 apply. Here we will pull a model from the Hugging Face Hub
(more on Hugging Face in Chapter 3 ), but this could equally have been an MLflow Tracking instance or some other tool.
You need to define the tests you want to run and that you are confident will achieve the desired outcome. You do not want to write tests that are far too sensitive and trigger failed builds for spurious reasons, and you also want to try and define tests that are useful for capturing the types of failures you would want to flag.
For point 1 , here we grab some data from the sklearn
library and make it available to the tests through a pytest fixture
:
@pytest.fixture
def test_dataset () -> Union [np.array, np.array]:
X, y = load_wine(return_X_y=True )
y = y == 2
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=42 )
return X_test, y_test
For point 2 , I will use the Hugging Face Hub
package to retrieve the stored model. As mentioned in the bullets above, you will need to adapt this to whatever model storage mechanism you are accessing. The repository in this case is public so there is no need to store any secrets; if you did need to do this, please use the GitHub Secrets store.
@pytest.fixture
def model () -> sklearn.ensemble._forest.RandomForestClassifier:
REPO_ID = " electricweegie/mlewp-sklearn-wine"
FILENAME = "rfc.joblib"
model = joblib.load(hf_hub_download(REPO_ID, FILENAME))
return model
Now, we just need to write the tests. Let’s start simple with a test that confirms that the predictions of the model produce the correct object types:
def test_model_inference_types (model, test_dataset ):
assert isinstance (model.predict(test_dataset[0 ]), np.ndarray)
assert isinstance (test_dataset[0 ], np.ndarray)
assert isinstance (test_dataset[1 ], np.ndarray)
We can then write a test to assert some specific conditions on the performance of the model on the test dataset is met:
def test_model_performance (model, test_dataset ):
metrics = classification_report(y_true=test_dataset[1 ],
y_pred=model.predict(test_dataset[0 ]),
output_dict=True )
assert metrics['False' ]['f1-score' ] > 0.95
assert metrics['False' ]['precision' ] > 0.9
assert metrics['True' ]['f1-score' ] > 0.8
assert metrics['True' ]['precision' ] > 0.8
The previous test can be thought of as something like a data-driven unit test and will make sure that if you change something in the model (perhaps you change some feature engineering step in the pipeline or you change a hyperparameter), you will not breach the desired performance criteria. Once these tests have been successfully added to the repo, on the next push, the GitHub action will be triggered and you will see that the model performance test runs successfully.
This means we are performing some continuous model validation as part of our CI/CD process!
Figure 2.16: Successfully executing model validation tests as part of a CI/CD process using GitHub Actions.
More sophisticated tests can be built upon this simple concept, and you can adapt the environment and packages used to suit your needs.
Continuous model training
An important extension of the “continuous” concept in ML engineering is to perform continuous training. The previous section showed how to trigger some ML processes for testing purposes when pushing code; now, we will discuss how to extend this for the case where you want to trigger retraining of the model based on a code change. Later in this book, we will learn a lot about training and retraining ML models based on a variety of different triggers like data or model drift in Chapter 3 , From Model to Model Factory , and about how to deploy ML models in general in Chapter 5 , Deployment Patterns and Tools . Given this, we will not cover the details of deploying to different targets here but instead show you how to build continuous training steps into your CI/CD pipelines.
This is actually simpler than you probably think. As you have hopefully noticed by now, CI/CD is really all about automating a series of steps, which are triggered upon particular events occurring during the development process. Each of these steps can be very simple or more complex, but fundamentally it is always just other programs we are executing in the specified order upon activating the trigger event.
In this case, since we are concerned with continuous training, we should ask ourselves, when would we want to retrain during code development? Remember that we are ignoring the most obvious cases of retraining on a schedule or upon a drift in model performance or data quality, as these are touched on in later chapters. If we only consider that the code is changing for now, the natural answer is to train only when there is a substantial change to the code.
For example, if a trigger was fired every time we committed our code to version control, this would likely result in a lot of costly compute cycles being used for not much gain, as the ML model will likely not perform very differently in each case. We could instead limit the triggering of retraining to only occur when a pull request is merged into the main branch. In a project, this is an event that signifies a new software feature or functionality has been added and has now been incorporated into the core of the solution.
As a reminder, when building CI/CD in GitHub Actions, you create or edit YAML
files contained in the .github
folder of your Git repository. If we want to trigger a training process upon a pull request, then we can add something like:
name: Continous Training Example
on: [pull_request ]
And then we need to define the steps for pushing the appropriate training script to the target system and running it. First, this would likely require some fetching of access tokens. Let’s assume this is for AWS and that you have loaded your appropriate AWS credentials as GitHub Secrets; for more information, see Chapter 5 , Deployment Patterns and Tools . We would then be able to retrieve these in the first step of a deploy-trainer
job:
jobs:
deploy-trainer
runs-on: [ubuntu-latest ]
steps:
- name: Checkout uses: actions/checkout@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-2
role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
role-external-id: ${{ secrets.AWS_ROLE_EXTERNAL_ID }}
role-duration-seconds: 1200
role-session-name: TrainingSession
You may then want to copy your repository files to a target S3 destination; perhaps they contain modules that the main training script needs to run. You could then do something like this:
- name: Copy files to target destination
run: aws s3 sync . s3://<S3-BUCKET-NAME>
And finally, you would want to run some sort of process that uses these files to perform the training. There are so many ways to do this that I have left the specifics out for this example. Many ways for deploying ML processes will be covered in Chapter 5 , Deployment Patterns and Tools :
- name: Run training job
run: |
And with that, you have all the key pieces you need to run continuous ML model training to complement the other section on continuous model performance testing. This is how you bring the DevOps concept of CI/CD to the world of MLOps!