You're reading from The Machine Learning Solutions Architect Handbook Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI

Product type Paperback

Published in Apr 2024

Publisher Packt

ISBN-13 9781805122500

Length 602 pages

Edition 2nd Edition

Languages

Python

Tools

MLOps

Concepts

Machine Learning

Author (1):

David Ping

View More author details

Table of Contents (19) Chapters

Preface

1. Navigating the ML Lifecycle with ML Solutions Architecture FREE CHAPTER

2. Exploring ML Business Use Cases

3. Exploring ML Algorithms

4. Data Management for ML

5. Exploring Open-Source ML Libraries

6. Kubernetes Container Orchestration Infrastructure Management

7. Open-Source ML Platforms

8. Building a Data Science Environment Using AWS ML Services

9. Designing an Enterprise ML Architecture with AWS ML Services

10. Advanced ML Engineering

11. Building ML Solutions with AWS AI Services

12. AI Risk Management

13. Bias, Explainability, Privacy, and Adversarial Attacks

14. Charting the Course of Your ML Journey

15. Navigating the Generative AI Project Lifecycle

16. Designing Generative AI Platforms and Solutions

17. Other Books You May Enjoy

18. Index

Understanding the scikit-learn ML library

scikit-learn (https://scikit-learn.org/) is an open-source ML library for Python. Initially released in 2007, it is one of the most popular ML libraries for solving many ML tasks, such as classification, regression, clustering, and dimensionality reduction. scikit-learn is widely used by companies in different industries and academics for solving real-world business cases such as churn prediction, customer segmentation, recommendations, and fraud detection.

scikit-learn is built mainly on top of three foundational libraries: NumPy, SciPy, and Matplotlib:

NumPy is a Python-based library for managing large, multidimensional arrays and matrices, with additional mathematical functions to operate on the arrays and matrices.
SciPy provides scientific computing functionality, such as optimization, linear algebra, and Fourier transform.
Matplotlib is used for plotting data for data visualization.

In all, scikit-learn is a sufficient and effective tool for a range of common data processing and model-building tasks.

Installing scikit-learn

You can easily install the scikit-learn package on different operating systems such as macOS, Windows, and Linux. The scikit-learn library package is hosted on the Python Package Index site (https://pypi.org/) and the Anaconda package repository (https://anaconda.org/anaconda/repo). To install it in your environment, you can use either the pip package manager or the Conda package manager. A package manager allows you to install and manage the installation of library packages in your operating system.

To install the scikit-learn library using the pip or Conda package manager, you can simply run pip install -U scikit-learn to install it from the PyPI index or run conda install scikit-learn if you want to use a Conda environment. You can learn more about pip at https://pip.pypa.io/ and Conda at http://docs.conda.io.

Core components of scikit-learn

The scikit-learn library provides a wide range of Python classes and functionalities for the various stages of the ML lifecycle. It consists of several main components, as depicted in the following diagram. By utilizing these components, you can construct ML pipelines and perform tasks such as classification, regression, and clustering.

Figure 5.1: scikit-learn components

Now, let’s delve deeper into how these components support the different stages of the ML lifecycle:

Preparing data: For data manipulation and processing, the pandas library is commonly used. It provides core data loading and saving functions, as well as utilities for data manipulations such as data selection, data arrangement, and data statistical summaries. pandas is built on top of NumPy. The pandas library also comes with some visualization features such as pie charts, scatter plots, and box plots.
scikit-learn provides a list of transformers for data processing and transformation, such as imputing missing values, encoding categorical values, normalization, and feature extraction for text and images. You can find the full list of transformers at https://scikit-learn.org/stable/data_transforms.html. Furthermore, you have the flexibility to create custom transformers.

Model training: scikit-learn provides a long list of ML algorithms (also known as estimators) for classification and regression (for example, logistic regression, k-nearest neighbors, and random forest), as well as clustering (for example, k-means). You can find the full list of algorithms at https://scikit-learn.org/stable/index.html. The following sample code shows the syntax for using the RandomForestClassifier algorithm to train a model using a labeled training dataset:
```
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier (
  max_depth, max_features, n_estimators
)
model.fit(train_X, train_y)
```
Model evaluation: scikit-learn has utilities for hyperparameter tuning and cross-validation, as well as metrics classes for model evaluations. You can find the full list of model selection and evaluation utilities at https://scikit-learn.org/stable/model_selection.html. The following sample code shows the accuracy_score class for evaluating the accuracy of classification models:
```
from sklearn.metrics import accuracy_score
acc = accuracy_score (true_label, predicted_label)
```
Hyperparameter tuning involves optimizing the configuration settings (hyperparameters) of an ML model to enhance its performance and achieve better results on a given task or dataset. Cross-validation is a statistical technique used to assess the performance and generalizability of an ML model by dividing the dataset into multiple subsets, training the model on different combinations, and evaluating its performance across each subset.

Model saving: scikit-learn can save model artifacts using Python object serialization (pickle or joblib). The serialized pickle file can be loaded into memory for predictions. The following sample code shows the syntax for saving a model using the joblib class:
```
import joblib
joblib.dump(model, "saved_model_name.joblib")
```
Pipeline: scikit-learn also provides a pipeline utility for stringing together different transformers and estimators as a single processing pipeline, and it can be reused as a single unit. This is especially useful when you need to preprocess data for modeling training and model prediction, as both require the data to be processed in the same way:
```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([('scaler', StandardScaler()), (RF, RandomForestClassifier())])
pipe.fit(X_train, y_train)
```

As demonstrated, getting started with scikit-learn for experimenting with and constructing ML models is straightforward. scikit-learn is particularly suitable for typical regression, classification, and clustering tasks performed on a single machine. However, if you’re working with extensive datasets or require distributed training across multiple machines, scikit-learn may not be the optimal choice unless the algorithm supports incremental training, such as SGDRegressor. Therefore, moving on, let’s explore alternative ML libraries that excel in large-scale model training scenarios.

Incremental training is an ML approach where a model is updated and refined continuously as new data becomes available, allowing the model to adapt to evolving patterns and improve its performance over time.