You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers Take your machine learning software from a prototype to a fully fledged software system

Product type Paperback

Published in Jan 2024

Publisher Packt

ISBN-13 9781837634064

Length 346 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Author (1):

Miroslaw Staron

View More author details

Table of Contents (24) Chapters

Preface

1. Part 1:Machine Learning Landscape in Software Engineering

2. Machine Learning Compared to Traditional Software FREE CHAPTER

3. Elements of a Machine Learning System

4. Data in Software Systems – Text, Images, Code, and Their Annotations

5. Data Acquisition, Data Quality, and Noise

6. Quantifying and Improving Data Properties

7. Part 2: Data Acquisition and Management

8. Processing Data in Machine Learning Systems

9. Feature Engineering for Numerical and Image Data

10. Feature Engineering for Natural Language Data

11. Part 3: Design and Development of ML Systems

12. Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)

13. Training and Evaluating Classical Machine Learning Systems and Neural Networks

14. Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

15. Designing Machine Learning Pipelines (MLOps) and Their Testing

16. Designing and Implementing Large-Scale, Robust ML Software

17. Part 4: Ethical Aspects of Data Management and ML System Development

18. Ethics in Data Acquisition and Management

19. Ethics in Machine Learning Systems

20. Integrating ML Systems in Ecosystems

21. Summary and Where to Go Next

22. Index

Why subscribe?

23. Other Books You May Enjoy

Testing and evaluation – the same but different

Every machine learning model needs to be validated, which means that the model needs to be able to provide correct inferences for a dataset that the model did not see before. The goal is to assess whether the model has learned patterns in the data, the data itself, or neither. The typical measures of correctness in classification problems are accuracy (the quotient of correctly inferred instances to all classified instances), Area Under Curve/Receiver Operation Characteristics (AUROC), and the true positive ratio (TPR) and false positive ratio (FPR).

For prediction problems, the quality of the model is measured in the mispredictions, such as the mean squared error (MSE). These measures quantify the errors in predictions – the smaller the values, the better the model. Figure 1.5 shows the process for the most common form of supervised learning:

Figure 1.5 – Model evaluation process for supervised learning

In this process, the model is subjected to different data for every iteration of training, after which it is used to make inferences (classifications or regression) on the same test data. The test data is set aside before training, and it is used as input to the model only when validating, never during training.

Finally, some models are reinforcement learning models, where the quality is assessed by the ability of the model to optimize the output according to a predefined function (reward function). These measures allow the algorithm to optimize its operations and find the optimal solution – for example, in genetic algorithms, self-driving cars, or energy grid operations. The challenge with these models is that there is no single metric that can measure performance – it depends on the scenario, the function, and the amount of training that the model received. One famous example of such training is the algorithm from the War Games movie (from 1983), where the main supercomputer plays millions of tic-tac-toe games to understand that there is no strategy to win – the game has no winner.

Figure 1.6 presents the process of training a reinforcement system graphically:

Figure 1.6 – Reinforcement learning training process

We could get the impression that training, testing, and validating machine learning models are all we need when developing machine learning software. This is far from being true. The models are parts of larger systems, which means that they need to be integrated with other components; these components are not validated in the process of validation described in Figure 1.5 and Figure 1.6.

Every software system needs to undergo rigorous testing before it can be released. The goal of this testing is to find and remove as many defects as possible so that the user of the software experiences the best possible quality. Typically, the process of testing software is a process that comprises multiple phases. The process of testing follows the process of software development and aligns with that. In the beginning, software engineers (or testers) use unit tests to verify the correctness of their components.

Figure 1.7 presents how these three types of testing are related to one another. In unit testing, the focus is on algorithms. Often, this means that the software engineers must test individual functions and modules. Integration testing focuses on the connections between modules and how they can conduct tasks together. Finally, system testing and acceptance testing focus on the entire software product. The testers imitate real users to check that the software fulfills the requirements of the users:

Figure 1.7 – Three types of software testing – unit testing (left), integration testing (middle), and system and acceptance testing (right)

The software testing process is very different than the process of model validation. Although we could mistake unit testing for model validation, this is not entirely the case. The output from the model validation process is one of the metrics (for example, accuracy), whereas the output from the unit test is true/false – whether the software produces the expected output or not. No known defects (equivalent to the false test results) are acceptable for a software company.

In traditional software testing, software engineers prepare a set of test cases to check whether their software works according to the specification. In machine learning software, the process of testing is based on setting aside part of the dataset (the test set) and checking how well the trained model (on the train set) works on that data.

Therefore, here is my fourth best practice for testing machine learning systems.

Best practice #4

Test the machine learning software as an addition to the typical train-validation-evaluation process of machine learning model development.

Testing the entire system is very important as the entire software system contains mechanisms to cope with the probabilistic nature of machine learning components. One such mechanism is the safety cage mechanism, where we can monitor the behavior of the machine learning components and prevent them from providing low-quality signals to the rest of the system (in the case of corner cases, close to the decision boundaries, in the inference process).

When we test the software, we also learn about the limitations of the machine learning components and our ability to handle the corner cases. Such knowledge is important for deploying the system when we need to specify the operational environment for the software. We need to understand the limitations related to the requirements and the specification of the software – the use cases for our software. Even more importantly, we need to understand the implications of the use of the software in terms of ethics and trustworthiness.

We’ll discuss ethics in Chapter 15 and Chapter 16, but it is important to understand that we need to consider ethics from the very beginning. If we don’t, we risk that our system makes potentially harmful mistakes, such as the ones made by large artificial intelligence hiring systems, face recognition systems, or self-driving vehicles. These harmful mistakes entail monetary costs, but more importantly, they entail loss of trust in the product and even missed opportunities.

You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers Take your machine learning software from a prototype to a fully fledged software system

Table of Contents (24) Chapters

Testing and evaluation – the same but different

Authors (1)

Personalised recommendations for you