Understanding the scikit-learn ML library
scikit-learn (https://scikit-learn.org/) is an open-source ML library for Python. Initially released in 2007, it is one of the most popular ML libraries for solving many ML tasks, such as classification, regression, clustering, and dimensionality reduction. scikit-learn is widely used by companies in different industries and academics for solving real-world business cases such as churn prediction, customer segmentation, recommendations, and fraud detection.
scikit-learn is built mainly on top of three foundational libraries: NumPy, SciPy, and Matplotlib:
- NumPy is a Python-based library for managing large, multidimensional arrays and matrices, with additional mathematical functions to operate on the arrays and matrices.
- SciPy provides scientific computing functionality, such as optimization, linear algebra, and Fourier transform.
- Matplotlib is used for plotting data for data visualization.
In all, scikit-learn is a sufficient and effective tool for a range of common data processing and model-building tasks.
Installing scikit-learn
You can easily install the scikit-learn package on different operating systems such as macOS, Windows, and Linux. The scikit-learn library package is hosted on the Python Package Index site (https://pypi.org/) and the Anaconda package repository (https://anaconda.org/anaconda/repo). To install it in your environment, you can use either the pip
package manager or the Conda package manager. A package manager allows you to install and manage the installation of library packages in your operating system.
To install the scikit-learn
library using the pip
or Conda package manager, you can simply run pip install -U scikit-learn
to install it from the PyPI index or run conda install scikit-learn
if you want to use a Conda environment. You can learn more about pip
at https://pip.pypa.io/ and Conda at http://docs.conda.io.
Core components of scikit-learn
The scikit-learn library provides a wide range of Python classes and functionalities for the various stages of the ML lifecycle. It consists of several main components, as depicted in the following diagram. By utilizing these components, you can construct ML pipelines and perform tasks such as classification, regression, and clustering.
Figure 5.1: scikit-learn components
Now, let’s delve deeper into how these components support the different stages of the ML lifecycle:
- Preparing data: For data manipulation and processing, the
pandas
library is commonly used. It provides core data loading and saving functions, as well as utilities for data manipulations such as data selection, data arrangement, and data statistical summaries.pandas
is built on top of NumPy. Thepandas
library also comes with some visualization features such as pie charts, scatter plots, and box plots.scikit-learn provides a list of transformers for data processing and transformation, such as imputing missing values, encoding categorical values, normalization, and feature extraction for text and images. You can find the full list of transformers at https://scikit-learn.org/stable/data_transforms.html. Furthermore, you have the flexibility to create custom transformers.
- Model training:
scikit-learn
provides a long list of ML algorithms (also known as estimators) for classification and regression (for example, logistic regression, k-nearest neighbors, and random forest), as well as clustering (for example, k-means). You can find the full list of algorithms at https://scikit-learn.org/stable/index.html. The following sample code shows the syntax for using theRandomForestClassifier
algorithm to train a model using a labeled training dataset:from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier ( max_depth, max_features, n_estimators ) model.fit(train_X, train_y)
- Model evaluation: scikit-learn has utilities for hyperparameter tuning and cross-validation, as well as
metrics
classes for model evaluations. You can find the full list of model selection and evaluation utilities at https://scikit-learn.org/stable/model_selection.html. The following sample code shows theaccuracy_score
class for evaluating the accuracy of classification models:from sklearn.metrics import accuracy_score acc = accuracy_score (true_label, predicted_label)
Hyperparameter tuning involves optimizing the configuration settings (hyperparameters) of an ML model to enhance its performance and achieve better results on a given task or dataset. Cross-validation is a statistical technique used to assess the performance and generalizability of an ML model by dividing the dataset into multiple subsets, training the model on different combinations, and evaluating its performance across each subset.
- Model saving: scikit-learn can save model artifacts using Python object serialization (
pickle
orjoblib
). The serializedpickle
file can be loaded into memory for predictions. The following sample code shows the syntax for saving a model using thejoblib
class:import joblib joblib.dump(model, "saved_model_name.joblib")
- Pipeline: scikit-learn also provides a pipeline utility for stringing together different transformers and estimators as a single processing pipeline, and it can be reused as a single unit. This is especially useful when you need to preprocess data for modeling training and model prediction, as both require the data to be processed in the same way:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier pipe = Pipeline([('scaler', StandardScaler()), (RF, RandomForestClassifier())]) pipe.fit(X_train, y_train)
As demonstrated, getting started with scikit-learn for experimenting with and constructing ML models is straightforward. scikit-learn is particularly suitable for typical regression, classification, and clustering tasks performed on a single machine. However, if you’re working with extensive datasets or require distributed training across multiple machines, scikit-learn may not be the optimal choice unless the algorithm supports incremental training, such as SGDRegressor
. Therefore, moving on, let’s explore alternative ML libraries that excel in large-scale model training scenarios.
Incremental training is an ML approach where a model is updated and refined continuously as new data becomes available, allowing the model to adapt to evolving patterns and improve its performance over time.