Packt+ | Advance your knowledge in tech

You're reading from Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits A practical guide to implementing supervised and unsupervised machine learning algorithms in Python

Product type Paperback

Published in Jul 2020

Publisher Packt

ISBN-13 9781838826048

Length 384 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Author (1):

Tarek Amr

View More author details

Introduction to scikit-learn

Since you have already picked up this book, you probably don't need me to convince you why machine learning is important. However, you may still have doubts about why to use scikit-learn in particular. You may encounter names such as TensorFlow, PyTorch, and Spark more often during your daily news consumption than scikit-learn. So, let me convince you of my preference for the latter.

It plays well with the Python data ecosystem

scikit-learn is a Python toolkit built on top of NumPy, SciPy, and Matplotlib. These choices mean that it fits well into your daily data pipeline. As a data scientist, Python is most likely your language of choice since it is good for both offline analysis and real-time implementations. You will also be using tools such as pandas to load data from your database, which allows you to perform a vast amount of transformation to your data. Since both pandas and scikit-learn are built on top of NumPy, they play very well with each other. Matplotlib is the de facto data visualization tool for Python, which means you can use its sophisticated data visualization capabilities to explore your data and unravel your model's ins and outs.

Since it is an open source tool that is heavily used in the community, it is very common to see other data tools use an almost identical interface to scikit-learn. Many of these tools are built on top of the same scientific Python libraries, and they are collectively known as SciKits (short for SciPyToolkits)—hence, the scikit prefix in scikit-learn. For example, scikit-image is a library for image processing, while categorical-encoding and imbalanced-learn are separate libraries for data preprocessing that are built as add-ons to scikit-learn.

We are going to use some of these tools in this book, and you will notice how easy it is to integrate these different tools into your workflow when using scikit-learn.

Being a key player in the Python data ecosystem is what makes scikit-learn the de facto toolset for machine learning. This is the tool that you will most likely hand your job application assignment to, as well as use for Kaggle competitions and to solve most of your professional day-to-day machine learning problems for your job.

Practical level of abstraction

scikit-learn implements a vast amount of machine learning, data processing, and model selection algorithms. These implementations are abstract enough, so you only need to apply minor changes when switching from one algorithm to another. This is a key feature since you will need to quickly iterate between different algorithms when developing a model to pick the best one for your problem. Having that said, this abstraction doesn't shield you from the algorithms' configurations. In other words, you are still in full control of your hyperparameters and settings.

When not to use scikit-learn

Most likely, the reasons to not use scikit-learn will include combinations of deep learning or scale. scikit-learn's implementation of neural networks is limited. Unlike scikit-learn, TensorFlow and PyTorch allow you to use a custom architecture, and they support GPUs for a massive training scale. All of scikit-learn's implementations run in memory on a single machine. I'd say that way more than 90% of businesses are at a scale where these constraints are fine. Data scientists can still fit their data in memory in large enough machines thanks to the cloud optionsavailable. They can cleverly engineer workarounds to deal with scaling issues, but if these limitations become something that they can no longer deal with, then they will need other tools to do the trick for them.

There are solutions being developed that allow scikit-learn to scale to multiple machines, such as Dask. Many scikit-learn algorithms allow parallel execution using joblib, which natively provides thread-based and process-based parallelism. Dask can scale these joblib-backed algorithms out to a cluster of machines by providing an alternative joblib backend.