You're reading from Machine Learning Engineering with MLflow Manage the end-to-end machine learning life cycle with MLflow

Product type Paperback

Published in Aug 2021

Publisher Packt

ISBN-13 9781800560796

Length 248 pages

Edition 1st Edition

Tools

Maven

Concepts

Machine Learning

Author (1):

Natu Lauchande

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1: Problem Framing and Introductions

2. Chapter 1: Introducing MLflow FREE CHAPTER

3. Chapter 2: Your Machine Learning Project

4. Section 2: Model Development and Experimentation

5. Chapter 3: Your Data Science Workbench

6. Chapter 4: Experiment Management in MLflow

7. Chapter 5: Managing Models with MLflow

8. Section 3: Machine Learning in Production

9. Chapter 6: Introducing ML Systems Architecture

10. Chapter 7: Data and Feature Management

11. Chapter 8: Training Models with MLflow

12. Chapter 9: Deployment and Inference with MLflow

13. Section 4: Advanced Topics

14. Chapter 10: Scaling Up Your Machine Learning Workflow

15. Chapter 11: Performance Monitoring

16. Chapter 12: Advanced Topics with MLflow

17. Other Books You May Enjoy

Getting started with MLflow

Next, we will install MLflow on your machine and prepare it for use in this chapter. You will have two options when it comes to installing MLflow. The first option is through a Docker container-based recipe provided in the repository of the book: https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Mlflow.git.

To install it, follow these instructions:

Use the following commands to install the software:

$ git clone https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Mlflow.git
$ cd Machine-Learning-Engineering-with-Mlflow
$ cd Chapter01

The Docker image is very simple at this stage: it simply contains MLflow and sklearn, the main tools to be used in this chapter of the book. For illustrative purposes, you can look at the content of the Dockerfile:
```
FROM jupyter/scipy-notebook
RUN pip install mlflow
RUN pip install sklearn
```
To build the image, you should now run the following command:
```
docker build -t chapter_1_homlflow
```
Right after building the image, you can run the ./run.sh command:
```
./run.sh
```
Important note
It is important to ensure that you have the latest version of Docker installed on your machine.
Open your browser to http://localhost:888 and you should be able to navigate to the Chapter01 folder.

In the following section, we will be developing our first model with MLflow in the Jupyter environment created in the previous set of steps.

Developing your first model with MLflow

From the point of view of simplicity, in this section, we will use the built-in sample datasets in sklearn, the ML library that we will use initially to explore MLflow features. For this section, we will choose the famous Iris dataset to train a multi-class classifier using MLflow.

The Iris dataset (one of sklearn's built-in datasets available from https://scikit-learn.org/stable/datasets/toy_dataset.html) contains the following elements as features: sepal length, sepal width, petal length, and petal width. The target variable is the class of the iris: Iris Setosa, Iris Versocoulor, or Iris Virginica:

Load the sample dataset:

from sklearn import datasets
from sklearn.model_selection import train_test_split
dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4)

Next, let's train your model.
Training a simple machine model with a framework such as scikit-learn involves instantiating an estimator such as LogisticRegression and calling the fit command to execute training over the Iris dataset built in scikit-learn:
```
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
```
The preceding lines of code are just a small portion of the ML Engineering process. As will be demonstrated, a non-trivial amount of code needs to be created in order to productionize and make sure that the preceding training code is usable and reliable. One of the main objectives of MLflow is to aid in the process of setting up ML systems and projects. In the following sections, we will demonstrate how MLflow can be used to make your solutions robust and reliable.
Then, we will add MLflow.
With a few more lines of code, you should be able to start your first MLflow interaction. In the following code listing, we start by importing the mlflow module, followed by the LogisticRegression class in scikit-learn. You can use the accompanying Jupyter notebook to run the next section:
```
import mlflow
from sklearn.linear_model import LogisticRegression
mlflow.sklearn.autolog()
with mlflow.start_run():
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
```
The mlflow.sklearn.autolog() instruction enables you to automatically log the experiment in the local directory. It captures the metrics produced by the underlying ML library in use. MLflow Tracking is the module responsible for handling metrics and logs. By default, the metadata of an MLflow run is stored in the local filesystem.

If you run the following excerpt on the accompanying notebook's root document, you should now have the following files in your home directory as a result of running the following command:

$ ls -l 
total 24
-rw-r--r-- 1 jovyan users 12970 Oct 14 16:30 chapther_01_introducing_ml_flow.ipynb
-rw-r--r-- 1 jovyan users    53 Sep 30 20:41 Dockerfile
drwxr-xr-x 4 jovyan users   128 Oct 14 16:32 mlruns
-rwxr-xr-x 1 jovyan users    97 Oct 14 13:20 run.sh

The mlruns folder is generated alongside your notebook folder and contains all the experiments executed by your code in the current context.

The mlruns folder will contain a folder with a sequential number identifying your experiment. The outline of the folder will appear as follows:

├── 46dc6db17fb5471a9a23d45407da680f
│   ├── artifacts
│   │   └── model
│   │       ├── MLmodel
│   │       ├── conda.yaml
│   │       ├── input_example.json
│   │       └── model.pkl
│   ├── meta.yaml
│   ├── metrics
│   │   └── training_score
│   ├── params
│   │   ├── C
│   │   …..
│   └── tags
│       ├── mlflow.source.type
│       └── mlflow.user
└── meta.yaml

So, with very little effort, we have a lot of traceability available to us, and a good foundation to improve upon.

Your experiment is identified as UUID on the preceding sample by 46dc6db17fb5471a9a23d45407da680f. At the root of the directory, you have a yaml file named meta.yaml, which contains the content:

artifact_uri: file:///home/jovyan/mlruns/0/518d3162be7347298abe4c88567ca3e7/artifacts
end_time: 1602693152677
entry_point_name: ''
experiment_id: '0'
lifecycle_stage: active
name: ''
run_id: 518d3162be7347298abe4c88567ca3e7
run_uuid: 518d3162be7347298abe4c88567ca3e7
source_name: ''
source_type: 4
source_version: ''
start_time: 1602693152313
status: 3
tags: []
user_id: jovyan

This is the basic metadata of your experiment, with information including start time, end time, identification of the run (run_id and run_uuid), an assumption of the life cycle stage, and the user who executed the experiment. The settings are basically based on a default run, but provide valuable and readable information regarding your experiment:

├── 46dc6db17fb5471a9a23d45407da680f
│   ├── artifacts
│   │   └── model
│   │       ├── MLmodel
│   │  ^   ├── conda.yaml
│   │       ├── input_example.json
│   │       └── model.pkl

The model.pkl file contains a serialized version of the model. For a scikit-learn model, there is a binary version of the Python code of the model. Upon autologging, the metrics are leveraged from the underlying machine library in use. The default packaging strategy was based on a conda.yaml file, with the right dependencies to be able to serialize the model.

The MLmodel file is the main definition of the project from an MLflow project with information related to how to run inference on the current model.

The metrics folder contains the training score value of this particular run of the training process, which can be used to benchmark the model with further model improvements down the line.

The params folder on the first listing of folders contains the default parameters of the logistic regression model, with the different default possibilities listed transparently and stored automatically.