Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
The Machine Learning Solutions Architect Handbook

You're reading from   The Machine Learning Solutions Architect Handbook Practical strategies and best practices on the ML lifecycle, system design, MLOps, and generative AI

Arrow left icon
Product type Paperback
Published in Apr 2024
Publisher Packt
ISBN-13 9781805122500
Length 602 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
David Ping David Ping
Author Profile Icon David Ping
David Ping
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Navigating the ML Lifecycle with ML Solutions Architecture FREE CHAPTER 2. Exploring ML Business Use Cases 3. Exploring ML Algorithms 4. Data Management for ML 5. Exploring Open-Source ML Libraries 6. Kubernetes Container Orchestration Infrastructure Management 7. Open-Source ML Platforms 8. Building a Data Science Environment Using AWS ML Services 9. Designing an Enterprise ML Architecture with AWS ML Services 10. Advanced ML Engineering 11. Building ML Solutions with AWS AI Services 12. AI Risk Management 13. Bias, Explainability, Privacy, and Adversarial Attacks 14. Charting the Course of Your ML Journey 15. Navigating the Generative AI Project Lifecycle 16. Designing Generative AI Platforms and Solutions 17. Other Books You May Enjoy
18. Index

Understanding the scikit-learn ML library

scikit-learn (https://scikit-learn.org/) is an open-source ML library for Python. Initially released in 2007, it is one of the most popular ML libraries for solving many ML tasks, such as classification, regression, clustering, and dimensionality reduction. scikit-learn is widely used by companies in different industries and academics for solving real-world business cases such as churn prediction, customer segmentation, recommendations, and fraud detection.

scikit-learn is built mainly on top of three foundational libraries: NumPy, SciPy, and Matplotlib:

  • NumPy is a Python-based library for managing large, multidimensional arrays and matrices, with additional mathematical functions to operate on the arrays and matrices.
  • SciPy provides scientific computing functionality, such as optimization, linear algebra, and Fourier transform.
  • Matplotlib is used for plotting data for data visualization.

In all, scikit-learn is a sufficient and effective tool for a range of common data processing and model-building tasks.

Installing scikit-learn

You can easily install the scikit-learn package on different operating systems such as macOS, Windows, and Linux. The scikit-learn library package is hosted on the Python Package Index site (https://pypi.org/) and the Anaconda package repository (https://anaconda.org/anaconda/repo). To install it in your environment, you can use either the pip package manager or the Conda package manager. A package manager allows you to install and manage the installation of library packages in your operating system.

To install the scikit-learn library using the pip or Conda package manager, you can simply run pip install -U scikit-learn to install it from the PyPI index or run conda install scikit-learn if you want to use a Conda environment. You can learn more about pip at https://pip.pypa.io/ and Conda at http://docs.conda.io.

Core components of scikit-learn

The scikit-learn library provides a wide range of Python classes and functionalities for the various stages of the ML lifecycle. It consists of several main components, as depicted in the following diagram. By utilizing these components, you can construct ML pipelines and perform tasks such as classification, regression, and clustering.

Figure 5.1 – scikit-learn components

Figure 5.1: scikit-learn components

Now, let’s delve deeper into how these components support the different stages of the ML lifecycle:

  • Preparing data: For data manipulation and processing, the pandas library is commonly used. It provides core data loading and saving functions, as well as utilities for data manipulations such as data selection, data arrangement, and data statistical summaries. pandas is built on top of NumPy. The pandas library also comes with some visualization features such as pie charts, scatter plots, and box plots.

    scikit-learn provides a list of transformers for data processing and transformation, such as imputing missing values, encoding categorical values, normalization, and feature extraction for text and images. You can find the full list of transformers at https://scikit-learn.org/stable/data_transforms.html. Furthermore, you have the flexibility to create custom transformers.

  • Model training: scikit-learn provides a long list of ML algorithms (also known as estimators) for classification and regression (for example, logistic regression, k-nearest neighbors, and random forest), as well as clustering (for example, k-means). You can find the full list of algorithms at https://scikit-learn.org/stable/index.html. The following sample code shows the syntax for using the RandomForestClassifier algorithm to train a model using a labeled training dataset:
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier (
      max_depth, max_features, n_estimators
    )
    model.fit(train_X, train_y)
    
  • Model evaluation: scikit-learn has utilities for hyperparameter tuning and cross-validation, as well as metrics classes for model evaluations. You can find the full list of model selection and evaluation utilities at https://scikit-learn.org/stable/model_selection.html. The following sample code shows the accuracy_score class for evaluating the accuracy of classification models:
    from sklearn.metrics import accuracy_score
    acc = accuracy_score (true_label, predicted_label)
    

    Hyperparameter tuning involves optimizing the configuration settings (hyperparameters) of an ML model to enhance its performance and achieve better results on a given task or dataset. Cross-validation is a statistical technique used to assess the performance and generalizability of an ML model by dividing the dataset into multiple subsets, training the model on different combinations, and evaluating its performance across each subset.

  • Model saving: scikit-learn can save model artifacts using Python object serialization (pickle or joblib). The serialized pickle file can be loaded into memory for predictions. The following sample code shows the syntax for saving a model using the joblib class:
    import joblib
    joblib.dump(model, "saved_model_name.joblib")
    
  • Pipeline: scikit-learn also provides a pipeline utility for stringing together different transformers and estimators as a single processing pipeline, and it can be reused as a single unit. This is especially useful when you need to preprocess data for modeling training and model prediction, as both require the data to be processed in the same way:
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import RandomForestClassifier
    pipe = Pipeline([('scaler', StandardScaler()), (RF, RandomForestClassifier())])
    pipe.fit(X_train, y_train)
    

As demonstrated, getting started with scikit-learn for experimenting with and constructing ML models is straightforward. scikit-learn is particularly suitable for typical regression, classification, and clustering tasks performed on a single machine. However, if you’re working with extensive datasets or require distributed training across multiple machines, scikit-learn may not be the optimal choice unless the algorithm supports incremental training, such as SGDRegressor. Therefore, moving on, let’s explore alternative ML libraries that excel in large-scale model training scenarios.

Incremental training is an ML approach where a model is updated and refined continuously as new data becomes available, allowing the model to adapt to evolving patterns and improve its performance over time.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image