Discovering BigQuery ML
Developing a new ML model can require a lot of effort and can be a time-consuming activity. It usually requires different skills and is a complex activity, especially in large enterprises. The typical journey of an ML model can be summarized with the following flow:
The first two steps involve preliminary raw data analyses and operations:
- In the Data Exploration and Understanding phase, the data engineer or data scientist takes a first look at the data, tries to understand the meaning of all the columns in the dataset, and then selects the fields to take into consideration for the new use case.
- During Data Preparation, the data engineer filters, aggregates, and cleans up the datasets, making them available and ready to use for the subsequent training phase.
After these two first stages, the actual ML developing process starts:
- Leveraging ML frameworks such as TensorFlow and programming languages such as Python, the data scientist will engage in the Design the ML model step, experimenting with different algorithms on the training dataset.
- When the right ML algorithm is selected, the data scientist performs the Tuning of the ML model step, applying feature engineering techniques and hyperparameter tuning to get better performance out of the ML model.
- When the model is ready, a final Evaluation step is executed on the evaluation dataset. This phase proves the effectiveness of the ML model on a new dataset that's different from the training one and eventually leads to further refinements of the asset.
- After the development process, the ML model is generally deployed and used in a production environment with scalability and robustness requirements.
- The ML model is also eventually updated in a subsequent stage due to different incoming data or to apply further improvements.
All of these steps require different skills and are based on the collaboration of different stakeholders, such as business analysts for data exploration and understanding, data engineers for data preparation, data scientists for the development of the ML model, and finally the IT department to make the model usable in a safe, robust, and scalable production environment.
BigQuery ML simplifies and accelerates the entire development process of a new ML model, allowing you to do the following:
- Design, train, evaluate, and serve the ML model, leveraging SQL and the existing skills in your company.
- Automate most of the tuning activities that are usually highly time-consuming to get an effective model.
- Ensure that you have a robust, scalable, and easy-to-use ML model, leveraging all the native features of BigQuery that we've already discussed in the BigQuery's advantages over traditional data warehouses section of this chapter.
In the following diagram, you can see the life cycle of an ML model that uses BigQuery ML:
Now that we've learned the basics of BigQuery ML, let's take a look at the main benefits that it can bring.
BigQuery ML benefits
BigQuery ML can bring both business and technical benefits during the life cycle of an ML model:
- Business users and data analysts can evolve from a traditional descriptive and reporting approach to a new predictive approach to take better decisions using their existing SQL skills.
- Technical users can benefit from the automation of BigQuery ML during the tuning phase of the model, using a unique, centralized tool that can accelerate the entire development process of an ML model.
- The development process is further sped up because the datasets required to build the ML model are already available to the right users and don't need to be moved from one data repository to another, which carries compliance and data duplication risks.
- The IT department does not need to manage the infrastructure to serve and use the ML model in a production environment because the BigQuery serverless architecture natively supports the model in a scalable, safe, and robust manner.
After our analysis of the benefits that BigQuery ML can bring, let's now see what the supported ML algorithms are.
BigQuery ML algorithms
The list of ML algorithms supported by BigQuery ML is growing quickly. Currently, the following supervised ML techniques are currently supported:
- Linear regression: To forecast numerical values with a linear model
- Binary logistic regression: For classification use cases when the choice is between only two different options (Yes or No, 1 or 0, True or False)
- Multiclass logistic regression: For classification scenarios when the choice is between multiple options
- Matrix factorization: For developing recommendation engines based on past information
- Time series: To forecast business KPIs leveraging timeseries data from the past
- Boosted tree: For classification and regression use cases with XGBoost
- AutoML table: To leverage AutoML capabilities from the BigQuery SQL interface
- Deep Neural Network (DNN): For developing TensorFlow models for classification or regression scenarios, avoiding any lines of code
When the training dataset doesn't contain labeled data, the learning is defined as unsupervised. BigQuery ML currently supports the following:
- K-means clustering: For data segmentation of similar objects (people, facts, events)
In addition to what is listed, BigQuery ML allows you to import and use pre-trained TensorFlow models using SQL statements.