Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
SQL for Data Analytics

You're reading from   SQL for Data Analytics Harness the power of SQL to extract insights from data

Arrow left icon
Product type Paperback
Published in Aug 2022
Publisher Packt
ISBN-13 9781801812870
Length 540 pages
Edition 3rd Edition
Languages
Arrow right icon
Authors (4):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Matt Goldwasser Matt Goldwasser
Author Profile Icon Matt Goldwasser
Matt Goldwasser
Jun Shan Jun Shan
Author Profile Icon Jun Shan
Jun Shan
Upom Malik Upom Malik
Author Profile Icon Upom Malik
Upom Malik
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

Preface 1. Understanding and Describing Data 2. The Basics of SQL for Analytics FREE CHAPTER 3. SQL for Data Preparation 4. Aggregate Functions for Data Analysis 5. Window Functions for Data Analysis 6. Importing and Exporting Data 7. Analytics Using Complex Data Types 8. Performant SQL 9. Using SQL to Uncover the Truth: A Case Study Appendix

Relational Databases and SQL

A relational database is a database that utilizes the relational model of data. The relational model, invented by Dr. Edgar F. Codd in 1970, organizes data as relations, or sets of tuple. Tuple is the mathematical term for a series of attributes grouped together in a particular order. A more common (and more practical) name for a tuple is a record. Each record consists of a series of attributes that generally describe the record.

For instance, a fast-moving consumer goods company wants to track its customers. They can save the customer information in a relation called customer_info. Each record in this relation contains details about one customer. The attributes in each record include information such as the customer's last name, first name, age, date of signup, and delivery address. This relationship and its first two records will look like this:

Figure 2.2: An example customer_info relation

Figure 2.2: An example customer_info relation

As you can see, each relation is indeed a two-dimensional table that looks like an Excel spreadsheet. Thus, when implemented in a relational database, these relations are called tables. Each table is made up of rows and columns. Each row of the table is a record, and the attributes are represented as columns of the table. There cannot be duplicate columns and the columns must follow the same order in all the rows. Every column also has a data type that describes the type of data in the column.

While not technically required, most tables in a relational database have a column (sometimes a group of columns) referred to as the primary key, which uniquely identifies a row of the database. In the example shown in Figure 2.2, each row contains a column called ID. This record, as the name suggests, is an attribute that can be used to uniquely identify this record. It is known as a relational key. In all other columns, you can have data duplicated across different rows. But in the primary key column(s), the data must be unique.

Most of the operations in a relational database, and in all data management systems, are organized around tables and the data inside them. They generally can be categorized into four groups—create, read, update, and delete. To utilize any data, you must create the definition of the dataset first, then create the individual data records one by one and put them into the dataset. Once a dataset is created, you can read all aspects of information from it. If there is any change to the data, you need to update the affected records.

Finally, when you do not need the data anymore, you will want to delete the records to save storage costs and increase performance. If you do not need this dataset, you can even delete the whole dataset by removing its definition from the database. These operations, by the order of each operation's position in a dataset's lifecycle, are generally called CRUD. CRUD stands for create, read, update, and delete.

In relational databases, all these operations are carried out using SQL. You will learn all the related SQL statements in this and the upcoming chapters.

Note

Virtually all relational databases that use SQL deviate from the relational model in some basic ways. For example, not every table has a specified relational key. Additionally, a relational model does not technically allow duplicate rows, but you can have duplicate rows in a relational database. These differences are minor and will not matter to most readers of this book.

Advantages and Disadvantages of SQL Databases

As discussed in the previous sections, since relations are collections of records that have clearly defined attributes in a defined order, they are considered structured data. Relational databases are the main tool used for storing and processing structured data.

Since the release of Oracle Database in 1979, SQL has become an industry standard for structured data in nearly all computer applications—and for good reasons. SQL databases provide a range of advantages that make them the first choice for many applications:

  • Intuitive: Relations represented as tables serve as a common data structure that almost everyone understands. As such, working with and reasoning about relational databases is much easier than doing so with other models.
  • Efficient: Using a technique known as normalization, relational databases allow the representation of data without unnecessarily repeating it. As such, relational databases can represent large amounts of information while utilizing less space. This reduced storage footprint also allows the database to reduce operation costs, making well-designed relational databases quick to process.
  • Declarative: SQL is a declarative language, meaning that when you write code, you only need to tell the computer what data you want, and the database takes care of determining how to execute the SQL code. You never have to worry about telling the computer how to access and pull data from the table.
  • Robust: Most popular SQL databases have a property known as atomicity, consistency, isolation, and durability (ACID) compliance, which guarantees the validity of the data, even if the hardware fails.

That said, there are still some downsides to SQL databases, which are as follows:

  • Relatively lower specificity: While SQL is declarative, its functionality can often be limited to what has already been programmed into it. Although most popular relational database software is updated constantly with new functionality being built all the time, it can be difficult to process and work with data structures and algorithms that are not programmed into a relational database.
  • Limited scalability: SQL databases are incredibly robust, but this robustness comes at a cost. As the amount of information you have doubles, the cost of resources increases even more than double. When very large volumes of information are involved, other data stores such as NoSQL databases may be efficient.
  • Sacrificing performance for consistency: Relational databases are generally designed for consistency, which means they will take extra steps to make sure multiple users will see the same data when they try to access/modify the data at the same time. To achieve this, relational databases implement some complex checking and data locking mechanisms into their operational logic. For usage scenarios that do not require consistency, especially for high-performance operations like search engines or social network sites, this is an unnecessary burden and will hurt the performance of the application.
  • Lack of semi-structured and unstructured data processing ability: The fundamental theory that SQL is built on is the relational theory, which, by definition, handles only structured data. Relational databases can store and fetch semi-structured and unstructured data. But processing this data requires processing power and functionalities that are beyond standard SQL. Later chapters of this book will cover some examples of this type of processing.
You have been reading a chapter from
SQL for Data Analytics - Third Edition
Published in: Aug 2022
Publisher: Packt
ISBN-13: 9781801812870
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime