Google BigQuery is a highly scalable, serverless, distributed data warehouse technology built internally by Google in 2006 and then released for public use on GCP in 2010. Thanks to its architecture, it can store petabytes of data and query them with high performance and on-demand scaling. Due to its serverless nature, users who store and query data on BigQuery don't have to manage the underlying infrastructure and can focus on implementing the logic that brings the business value, saving time and resources.
BigQuery is currently used by many large enterprises that leverage it to make data-driven decisions, including Twitter, The Home Depot, and Dow Jones.
BigQuery architecture
BigQuery has a distributed architecture running on thousands of nodes across Google's data centers. Your datasets are not stored in a unique server but are chunked and replicated across different regions to guarantee maximum performance and availability.
The storage and compute layers are fully decoupled in BigQuery. This means that the query engine runs on different servers from the servers where the data is stored. This feature enables BigQuery to provide great scalability both in terms of data volume and query execution. This decoupled paradigm is only possible thanks to Google's Petabit network, which moves data very quickly from one server to another, leveraging Google's proprietary fiber cables across the globe.
Now let's look deeper into how BigQuery manages storage and the compute engine.
Storage layer
Unlike traditional data warehouses, BigQuery stores data in columnar format rather than in row format. This approach enables you to do the following:
- Achieve a better compression ratio for each column, because the data in a column is typically homogeneous and simpler to compress.
- Reduce the amount of data to read and get the best possible performance for data warehouse use cases that are usually based on a small selection of columns in a table and aggregating operations such as sums, average, and maximum.
All the data is stored in Google's proprietary distributed filesystem named Google File System (codename Colossus). The distribution of the data allows it to guarantee faster I/O performance and better availability of data in the case of failures. Google File System is based on two different server types:
- Master servers: Nodes that don't store data but are responsible for managing the metadata of each file, such as the location and available number of replicas of each chunk that compose a file.
- Chunk servers: Nodes that actually store the chunks of files that are replicated across different servers.
In the following diagram, you can see how Google File System manages data:
Figure 1.8 – Google File System (Colossus) storage strategy
Now that we've learned how BigQuery handles large volumes of data, let's see how this data can be accessed by the compute layer.
Compute (query) layer
Fully decoupled from storage, the compute layer is responsible for receiving query statements from BigQuery users and executing them in the fastest way. The query engine is based on Dremel, a technology developed by Google and then published in a paper in 2010. This engine leverages a multi-level tree architecture:
- The root node of the tree receives the query to execute.
- The root node splits and distributes the query to other intermediate nodes named mixers.
- Mixer nodes have the task of rewriting queries before passing them to the leaf nodes or to other intermediate mixer nodes.
- Leaf nodes are responsible for parallelizing the reading of the chunks of data from Google File System.
- When the right chunks of data are extracted from the filesystem, leaf nodes perform computations on the data and eventually shuffle them across other leaf nodes.
- At the end of the computation, each leaf node produces a result that is returned to the parent node.
- When all the results are returned to the root node, the outcome of the query is sent to the user or application that requested the execution.
The execution process of a query on BigQuery based on the multi-level tree is represented in the following diagram:
Figure 1.9 – The BigQuery engine is a multi-level tree
Each node provides a number of processing units called BigQuery slots to execute the business logic of the query. A BigQuery slot can be considered a virtual CPU on a Dremel node. The calculation of the slots needed to perform a specific query is automatically managed by BigQuery depending on the complexity of the query and the impacted data volumes.
BigQuery's advantages over traditional data warehouses
Now that we've learned about the technical architecture underneath BigQuery, let's take a look at how this architecture translates into benefits for the enterprises that use it to become data-driven companies compared to other traditional on-premises data warehouses.
Serverless
As we have mentioned, BigQuery has a serverless architecture that saves users from having to manage not only the provisioning and maintenance of the servers but also all the maintenance operations related to the upgrading and patching of the operative system and of the database software that supports the functioning of the data warehouse. Thanks to the serverless approach, the user can easily start ingesting data into BigQuery and using it without having to perform capacity planning or any hardware and software provisioning upfront. This is particularly important for prototyping and to enable a fail-fast approach that favors a culture of innovation and experimentation.
Scalability
It doesn't matter if you need to store megabytes of data or petabytes; BigQuery can provide you with maximum flexibility and scalability in terms of both data storage and processing. Thanks to its multi-tenant architecture, a small or medium-sized business can leverage the same innovative capabilities as the largest enterprises, or they can start with a small use case to scale later, according to business needs. Traditional data warehouse technologies leverage the same servers to store and compute. For this reason, they are not particularly suitable for unbalanced use cases, such as when large volumes of data storage are needed but high computing performance is not required, or vice versa. Thanks to its decoupled architecture, as we've seen in the previous section, BigQuery is designed to independently scale storage and compute power according to the user's actual requirements, reducing the total cost of ownership of the solution.
Availability
Thanks to its resilient, distributed architecture, BigQuery is able to offer a Service Level Agreement (SLA) of monthly uptime percentage greater than 99.99%. This very high availability standard is granted by Google without any extra effort from the BigQuery users, who don't need to take care of high availability or disaster recovery strategies.
Performances
The BigQuery engine offers the ability to query terabytes of data in seconds and petabytes in minutes. This kind of performance is very hard to achieve with a traditional on-premises data warehouse. Higher performance means getting insights faster, as well as processing large volumes of data that very often would be impossible to manage on-premises without huge hardware and software investments. To further improve performance, BigQuery offers the possibility of enabling BigQuery BI Engine. BigQuery BI Engine is an in-memory analytics layer that can be activated on top of BigQuery to execute queries faster, with sub-second latencies.
Real-time
Traditional data warehouses are designed for long batch operations and are often unable to manage real-time workloads. BigQuery, however, provides a specific interface to ingest data in real time, making it immediately available for analysis. This feature opens up new possibilities to companies that want to accelerate their analytics and overcome the typical approach of data warehouses, which usually involves dealing with the business situation of the day before.
Format flexibility
BigQuery stores files in a compressed and optimized format in Google File System but provides the option to load data in various formats that are typically used in data lake technologies. Users can load data into BigQuery using Avro, ORC, CSV, and JSON formats.
Innovative capabilities
BigQuery offers two SQL extensions that are not available in traditional data warehouses:
- BigQuery GIS: Provides the ability to easily manage geospatial data with SQL statements to execute geographic calculations and operations
- BigQuery ML: Allows users to train, evaluate, and run ML models leveraging SQL language with no programming experience, accelerating the development of innovative use cases with ML
Security
By default, BigQuery automatically encrypts and decrypts customer's data before storing it in Google File System. BigQuery is also responsible for managing and rotating the encryption and decryption keys. To further improve security, BigQuery provides the option to use Customer-Managed Encryption Keys (CMEKs). In this case, the keys are managed directly by the customer in Google Cloud Key Management System.
Unauthorized access and use of data can be prevented by setting the right roles in Google Identity Access Management System (IAM).
Integration with other GCP services
One of the great benefits of using BigQuery is the native integration with a lot of other GCP services:
Figure 1.10 – BigQuery integration with other GCP services
As we can see in the preceding screenshot, Google BigQuery can read data from Google Bigtable, Google Cloud Storage, Google Cloud SQL, and Google Drive using external tables. This feature can simplify data ingestion into BigQuery from other databases, which can also be performed using ETL/ELT tools such as Google Dataflow and Data Fusion. When the datasets are stored in BigQuery, they can be accessed from other GCP components, such as Google DataProc, DataPrep for data processing and preparation, Data Studio, Looker, and Google Sheets for data visualization. BigQuery is integrated with AI Platform Notebooks to allow data scientists and engineers to easily access data from their Jupyter environments.
To summarize, bringing data into BigQuery opens a vast plethora of options that can be used according to the user's needs.
Rich ecosystem of partners
Beyond the integration with other GCP services, Google's partners provide connectors and integrations with BigQuery, creating a rich data management ecosystem. Some examples include Informatica, Tableau, Fivetran, Talend, and Confluent.
Public datasets
If you want to start from scratch using BigQuery, you can leverage existing public tables that are available from the BigQuery Public Datasets Program. The program contains interesting datasets coming from different industries and different countries about different topics. Some examples that we'll use in the next chapter to train our ML models include transactions from bike sharing services, open data on New York City, and records of taxi trips.
Interacting with BigQuery
There are different ways to interact with BigQuery. They are:
- The BigQuery web UI in the Google Cloud Console, the graphical user interface accessible from web browsers, represents the easiest way to interact with BigQuery.
- The
bq
command line, available with the installation of the Google Cloud SDK. It can be used to automate jobs and commands by including them in scripts.
- BigQuery REST APIs. The API layer natively provided by BigQuery can be used to integrate this service with other applications.
- Client libraries to favor the use of the most common programming languages, such as C#, Go, Java, Node.js, PHP, Python, and Ruby.
- JDBC/ODBC drivers, developed by Google's partner, Magnitude Simba Driver, are available for Windows, macOS, and Linux systems.
- Third parties and Google's partners have developed BigQuery connectors for their applications, such as Tableau for business intelligence, Informatica, and Talend for data ingestion and integration.
For our purposes, we'll take a look at the BigQuery web UI available in Google Cloud Console that will be used in the coming chapters to develop with BigQuery ML.
In the following screenshot, you can see how the BigQuery UI appears in the GCP console:
Figure 1.11 – BigQuery web and graphical user interface available in the GCP console
In the left column, the main BigQuery features are available and the datasets are listed and visible to the user. In this case, there is only one dataset.
The remaining part of the screen is occupied by the development canvas with the results and outcomes at the bottom. We'll learn how to use the BigQuery web UI in Chapter 2, Setting Up Your GCP and BigQuery Environment, when we'll create a GCP project and start using BigQuery.
BigQuery data structures
BigQuery structures, such as tables, views, and ML models, are organized in datasets. Each dataset is a container for different structures and can be used to control access to underlying data structures. A dataset is directly linked to the following:
- A GCP project that hosts the dataset itself and is usually linked to the billing account where the storage cost is billed
- A geographic location (regional or multi-regional) that is defined at creation time and cannot be changed later
- A specific name assigned to the dataset that should be unique in the GCP project
In the following diagram, you can see an example of a hierarchy composed of projects, datasets, tables, and BigQuery ML models:
Figure 1.12 – BigQuery hierarchy: datasets, tables, views, and ML models
A dataset hosted in GCP, Project A, can also be queried by users linked to another GCP project, Project B, if the right permissions are set. In this case, the storage costs are charged to Project A, which hosts the dataset structures, while the compute costs are billed on the billing account related to Project B. This is exactly what will happen when we use BigQuery public datasets in future chapters for hands-on exercises.
Tip
Remember, your queries can only include tables that reside in the same region. In BigQuery, you cannot query tables that are stored in different geographic locations. If you want to execute queries on tables located in different regions, you need to export and import the data into a dataset in the same region, passing through Google Cloud Storage.
Now that we've learned the main characteristics of BigQuery, let's focus more specifically on the core of this book: BigQuery ML.