You're reading from SQL for Data Analytics Harness the power of SQL to extract insights from data

Product type Paperback

Published in Aug 2022

Publisher Packt

ISBN-13 9781801812870

Length 540 pages

Edition 3rd Edition

Languages

SQL

Concepts

Data Analysis

Authors (4):

Benjamin Johnston

Matt Goldwasser

Jun Shan

Upom Malik

View More author details

Table of Contents (11) Chapters

Preface

1. Understanding and Describing Data

2. The Basics of SQL for Analytics FREE CHAPTER

3. SQL for Data Preparation

4. Aggregate Functions for Data Analysis

5. Window Functions for Data Analysis

6. Importing and Exporting Data

7. Analytics Using Complex Data Types

8. Performant SQL

9. Using SQL to Uncover the Truth: A Case Study

Appendix

Relational Databases and SQL

A relational database is a database that utilizes the relational model of data. The relational model, invented by Dr. Edgar F. Codd in 1970, organizes data as relations, or sets of tuple. Tuple is the mathematical term for a series of attributes grouped together in a particular order. A more common (and more practical) name for a tuple is a record. Each record consists of a series of attributes that generally describe the record.

For instance, a fast-moving consumer goods company wants to track its customers. They can save the customer information in a relation called customer_info. Each record in this relation contains details about one customer. The attributes in each record include information such as the customer's last name, first name, age, date of signup, and delivery address. This relationship and its first two records will look like this:

Figure 2.2: An example customer_info relation

As you can see, each relation is indeed a two-dimensional table that looks like an Excel spreadsheet. Thus, when implemented in a relational database, these relations are called tables. Each table is made up of rows and columns. Each row of the table is a record, and the attributes are represented as columns of the table. There cannot be duplicate columns and the columns must follow the same order in all the rows. Every column also has a data type that describes the type of data in the column.

While not technically required, most tables in a relational database have a column (sometimes a group of columns) referred to as the primary key, which uniquely identifies a row of the database. In the example shown in Figure 2.2, each row contains a column called ID. This record, as the name suggests, is an attribute that can be used to uniquely identify this record. It is known as a relational key. In all other columns, you can have data duplicated across different rows. But in the primary key column(s), the data must be unique.

Most of the operations in a relational database, and in all data management systems, are organized around tables and the data inside them. They generally can be categorized into four groups—create, read, update, and delete. To utilize any data, you must create the definition of the dataset first, then create the individual data records one by one and put them into the dataset. Once a dataset is created, you can read all aspects of information from it. If there is any change to the data, you need to update the affected records.

Finally, when you do not need the data anymore, you will want to delete the records to save storage costs and increase performance. If you do not need this dataset, you can even delete the whole dataset by removing its definition from the database. These operations, by the order of each operation's position in a dataset's lifecycle, are generally called CRUD. CRUD stands for create, read, update, and delete.

In relational databases, all these operations are carried out using SQL. You will learn all the related SQL statements in this and the upcoming chapters.

Note

Virtually all relational databases that use SQL deviate from the relational model in some basic ways. For example, not every table has a specified relational key. Additionally, a relational model does not technically allow duplicate rows, but you can have duplicate rows in a relational database. These differences are minor and will not matter to most readers of this book.

Advantages and Disadvantages of SQL Databases

As discussed in the previous sections, since relations are collections of records that have clearly defined attributes in a defined order, they are considered structured data. Relational databases are the main tool used for storing and processing structured data.

Since the release of Oracle Database in 1979, SQL has become an industry standard for structured data in nearly all computer applications—and for good reasons. SQL databases provide a range of advantages that make them the first choice for many applications:

Intuitive: Relations represented as tables serve as a common data structure that almost everyone understands. As such, working with and reasoning about relational databases is much easier than doing so with other models.
Efficient: Using a technique known as normalization, relational databases allow the representation of data without unnecessarily repeating it. As such, relational databases can represent large amounts of information while utilizing less space. This reduced storage footprint also allows the database to reduce operation costs, making well-designed relational databases quick to process.
Declarative: SQL is a declarative language, meaning that when you write code, you only need to tell the computer what data you want, and the database takes care of determining how to execute the SQL code. You never have to worry about telling the computer how to access and pull data from the table.
Robust: Most popular SQL databases have a property known as atomicity, consistency, isolation, and durability (ACID) compliance, which guarantees the validity of the data, even if the hardware fails.

That said, there are still some downsides to SQL databases, which are as follows:

Relatively lower specificity: While SQL is declarative, its functionality can often be limited to what has already been programmed into it. Although most popular relational database software is updated constantly with new functionality being built all the time, it can be difficult to process and work with data structures and algorithms that are not programmed into a relational database.
Limited scalability: SQL databases are incredibly robust, but this robustness comes at a cost. As the amount of information you have doubles, the cost of resources increases even more than double. When very large volumes of information are involved, other data stores such as NoSQL databases may be efficient.
Sacrificing performance for consistency: Relational databases are generally designed for consistency, which means they will take extra steps to make sure multiple users will see the same data when they try to access/modify the data at the same time. To achieve this, relational databases implement some complex checking and data locking mechanisms into their operational logic. For usage scenarios that do not require consistency, especially for high-performance operations like search engines or social network sites, this is an unnecessary burden and will hurt the performance of the application.
Lack of semi-structured and unstructured data processing ability: The fundamental theory that SQL is built on is the relational theory, which, by definition, handles only structured data. Relational databases can store and fetch semi-structured and unstructured data. But processing this data requires processing power and functionalities that are beyond standard SQL. Later chapters of this book will cover some examples of this type of processing.

You're reading from SQL for Data Analytics Harness the power of SQL to extract insights from data

Table of Contents (11) Chapters

Relational Databases and SQL

Advantages and Disadvantages of SQL Databases

Authors (4)

Personalised recommendations for you