You're reading from Machine Learning with BigQuery ML Create, execute, and improve machine learning models in BigQuery using standard SQL queries

Product type Paperback

Published in Jun 2021

Publisher Packt

ISBN-13 9781800560307

Length 344 pages

Edition 1st Edition

Languages

Python

Tools

BigQuery

Concepts

Machine Learning

Author (1):

Alessandro Marrandino

View More author details

Table of Contents (20) Chapters

Preface

1. Section 1: Introduction and Environment Setup

2. Chapter 1: Introduction to Google Cloud and BigQuery FREE CHAPTER

3. Chapter 2: Setting Up Your GCP and BigQuery Environment

4. Chapter 3: Introducing BigQuery Syntax

5. Section 2: Deep Learning Networks

6. Chapter 4: Predicting Numerical Values with Linear Regression

7. Chapter 5: Predicting Boolean Values Using Binary Logistic Regression

8. Chapter 6: Classifying Trees with Multiclass Logistic Regression

9. Section 3: Advanced Models with BigQuery ML

10. Chapter 7: Clustering Using the K-Means Algorithm

11. Chapter 8: Forecasting Using Time Series

12. Chapter 9: Suggesting the Right Product by Using Matrix Factorization

13. Chapter 10: Predicting Boolean Values Using XGBoost

14. Chapter 11: Implementing Deep Neural Networks

15. Section 4: Further Extending Your ML Capabilities with GCP

16. Chapter 12: Using BigQuery ML with AI Notebooks

17. Chapter 13: Running TensorFlow Models with BigQuery ML

18. Chapter 14: BigQuery ML Tips and Best Practices

19. Other Books You May Enjoy

Discovering BigQuery ML

Developing a new ML model can require a lot of effort and can be a time-consuming activity. It usually requires different skills and is a complex activity, especially in large enterprises. The typical journey of an ML model can be summarized with the following flow:

Figure 1.13 – An ML model's typical development life cycle

The first two steps involve preliminary raw data analyses and operations:

In the Data Exploration and Understanding phase, the data engineer or data scientist takes a first look at the data, tries to understand the meaning of all the columns in the dataset, and then selects the fields to take into consideration for the new use case.
During Data Preparation, the data engineer filters, aggregates, and cleans up the datasets, making them available and ready to use for the subsequent training phase.

After these two first stages, the actual ML developing process starts:

Leveraging ML frameworks such as TensorFlow and programming languages such as Python, the data scientist will engage in the Design the ML model step, experimenting with different algorithms on the training dataset.
When the right ML algorithm is selected, the data scientist performs the Tuning of the ML model step, applying feature engineering techniques and hyperparameter tuning to get better performance out of the ML model.
When the model is ready, a final Evaluation step is executed on the evaluation dataset. This phase proves the effectiveness of the ML model on a new dataset that's different from the training one and eventually leads to further refinements of the asset.
After the development process, the ML model is generally deployed and used in a production environment with scalability and robustness requirements.
The ML model is also eventually updated in a subsequent stage due to different incoming data or to apply further improvements.

All of these steps require different skills and are based on the collaboration of different stakeholders, such as business analysts for data exploration and understanding, data engineers for data preparation, data scientists for the development of the ML model, and finally the IT department to make the model usable in a safe, robust, and scalable production environment.

BigQuery ML simplifies and accelerates the entire development process of a new ML model, allowing you to do the following:

Design, train, evaluate, and serve the ML model, leveraging SQL and the existing skills in your company.
Automate most of the tuning activities that are usually highly time-consuming to get an effective model.
Ensure that you have a robust, scalable, and easy-to-use ML model, leveraging all the native features of BigQuery that we've already discussed in the BigQuery's advantages over traditional data warehouses section of this chapter.

In the following diagram, you can see the life cycle of an ML model that uses BigQuery ML:

Figure 1.14 – An ML model's development life cycle with BigQuery ML

Now that we've learned the basics of BigQuery ML, let's take a look at the main benefits that it can bring.

BigQuery ML benefits

BigQuery ML can bring both business and technical benefits during the life cycle of an ML model:

Business users and data analysts can evolve from a traditional descriptive and reporting approach to a new predictive approach to take better decisions using their existing SQL skills.
Technical users can benefit from the automation of BigQuery ML during the tuning phase of the model, using a unique, centralized tool that can accelerate the entire development process of an ML model.
The development process is further sped up because the datasets required to build the ML model are already available to the right users and don't need to be moved from one data repository to another, which carries compliance and data duplication risks.
The IT department does not need to manage the infrastructure to serve and use the ML model in a production environment because the BigQuery serverless architecture natively supports the model in a scalable, safe, and robust manner.

After our analysis of the benefits that BigQuery ML can bring, let's now see what the supported ML algorithms are.

BigQuery ML algorithms

The list of ML algorithms supported by BigQuery ML is growing quickly. Currently, the following supervised ML techniques are currently supported:

Linear regression: To forecast numerical values with a linear model
Binary logistic regression: For classification use cases when the choice is between only two different options (Yes or No, 1 or 0, True or False)
Multiclass logistic regression: For classification scenarios when the choice is between multiple options
Matrix factorization: For developing recommendation engines based on past information
Time series: To forecast business KPIs leveraging timeseries data from the past
Boosted tree: For classification and regression use cases with XGBoost
AutoML table: To leverage AutoML capabilities from the BigQuery SQL interface
Deep Neural Network (DNN): For developing TensorFlow models for classification or regression scenarios, avoiding any lines of code