In this section, we will learn about some of the core database concepts, including cardinality, database models, and various processing models.
Cardinality
Before we discuss database models, it is important to know about cardinality. Cardinality refers to the relationship between two entities or tables. The most popular ones include one-to-many, many-to-one, and many-to-many.
One-to-one relationship
In the case of a one-to-one relationship, a row or entry in one entity or table can be related to only one row in another entity or table. For example, in a Department of Motor Vehicles database, let's say there are two tables called License Info
and Driver Info
, as shown in the following diagram:
Figure 1.1 – An example of a one-to-one relationship
Here, Driver ID
can only be assigned to one driver as it has to uniquely identify a driver. Also, a driver can only be assigned one Driver ID
. So, here, any row in the License Info
table will be associated with a specific row in the Driver Info
table.
One-to-many relationship
In a one-to-many relationship, a single row from one entity or table can be associated with multiple rows in another entity or table.
For example, let's consider the Driver Info
and City Info
tables shown in the following diagram::
Figure 1.2 – An example of a one-to-many relationship
Here, for every row in City Info
, there will be multiple rows in Driver Info
, as there can be many drivers that live in a particular city.
Many-to-many relationship
In a many-to-many relationship, a single row in one entity or table can be associated with multiple rows in another entity or table and vice versa.
For example, let's consider two tables: Vehicle Ownership History
, where we are maintaining the history of ownership of a given vehicle, and Driver Ownership History
, where we are maintaining the history of vehicles owned by a given driver:
Figure 1.3 – An example of a many-to-many relationship
Here, a driver can own multiple vehicles and a vehicle can have multiple owners over time. So, a given row in the Vehicle Ownership History
table can be associated with multiple rows in the Driver Ownership History
table. Similarly, a given row in the Driver Ownership History
table can be associated with multiple rows in the Vehicle Ownership History
table.
Now, let's take a look at some of the most important database models.
Overview of database models
A database model determines how the data is stored, organized, and modified. Databases are typically implemented based on a specific data model. It is also possible to borrow concepts from multiple database models when you are designing a new database. The relational database model happens to be the most widely known and has been popularized by databases such as Oracle, IBM DB2, and MySQL.
Hierarchical database model
In the hierarchical database model, the data is organized in the form of a tree. There is a root at the first level and multiple children at the subsequent levels. Since a single parent can have multiple children, one-to-many relationships can easily be represented here. A child cannot have multiple parents, so this results in the advantage of not being able to model many-to-many relationships.
IBM's Information Management System (IMS) was the first database that implemented this data model.
The following diagram shows an example of a hierarchical database model:
Figure 1.4 – An example of a hierarchical database model
Typically, the tree starts with a single root and the data is organized into this tree. Any node except the leaves can have multiple children, but a child can have only one parent.
Network model
The network model was developed as an enhancement of the hierarchical database model to accommodate many-to-many relationships. The network model relies on a graph structure to organize its data. So, there is no concept of a single root, and a child can have multiple parents and a parent can have multiple children. Integrated Data Store (IDS), Integrated Database Management Systems (IDMS), and Raima Database Manager (RDM) are some of the popular databases that use the network model.
As shown in the following diagram, there is no single root and a given child (for example, Object 2 can have multiple parents; that is, Object 1 and Object 3):
Figure 1.5 – An example of a network model
Relational model
Although the network model was an improvement over the hierarchical model, it was still a little restrictive when it came to representing data. In the relational model, any record can have a relationship with any other with the help of a common field. This drastically reduced the design's complexity and made it easier to independently add, update, and access records, without having to walk down the tree or traverse the graph. SQL was combined with the relational database model to provide a simple query interface to add and retrieve data.
All the popular traditional databases such as Oracle database, IBM DB2, MySQL, MariaDB, and Microsoft SQL Server implement relational data models.
Let's look at two tables called Employee
and Employee Info
:
Figure 1.6 – Employee tables showing the column names
Here, Employee ID
is the common field or column between the Employee
and Employee Info
tables. The Employee
table is responsible for ensuring that a given Employee ID
is unique, while Employee Info
is responsible for more detailed information about a given employee.
Object-relational model
The object-relational model, as the name suggests, combines the best of the relational and object data models. The concept of objects, classes, and inheritance are directly supported as first-class citizens as part of the database and in queries. SQL:1999, the fourth revision of SQL, introduced several features for embedding object concepts into the relational database. One of the main features was to create structured user-defined types with CREATE TYPE
to define an object's structure.
Over time, relational databases have added more support for objects. There is a varying degree of support for object concepts in Oracle database, IBM DB2, PostgreSQL, and Microsoft SQL Server.
Given the scope of this book, we will not discuss the entity-relational model, object model, document model, star schema, snowflake schema, and many other less well-known models.
Now, let's look at how databases can be classified based on what kinds of workload they can be used for.
Processing models
Based on how you want to consume and process data, databases can be categorized into four different processing systems. Let's take a look.
Online transaction processing (OLTP)
OLTP systems support the concept of transactions. A transaction refers to the ability to atomically apply changes (insert, update, delete, and read) to a given system. One popular example is a bank, where withdrawing or depositing money to a given bank account must be done atomically to ensure data is not lost or incorrect. So, the main purpose here is to maintain data integrity and consistency. Also, these systems are generally suited for fast-running queries.
Online analytical processing (OLAP)
OLAP focuses mostly on running queries to analyze multi-dimensional data and to extract some intelligence or patterns from it. Typically, such systems support generating some sort of report that can be used for marketing, sales, financing, budgeting, management, and many more. Data mining and data analytics applications would typically have to have an OLAP system in some form. OLAP doesn't deal with transactions, and the emphasis is more on analyzing large amounts of data from different sources to extract business intelligence. Some databases also provide built-in support for MapReduce to run queries across a large set of data.
A data warehouse is a piece of software that's used for reporting and data analysis. Warehouses are typically developed for OLAP. It is also very common to retrieve the data from OLTP in batches or bulk, run it through an Extract, Load, and Transform (ELT) or Extract, Transform, and Load (ETL) data transformation pipeline, and store it in an OLAP system.
Online event processing (OLEP)
OLEP guarantees strong consistency without the traditional atomic commit protocols or distributed locking. OLEP also focuses on high performance, larger scales, and fault tolerance.
Hybrid transaction/analytical processing (HTAP)
As the name suggests, this system tries to provide the best of both transactions and analytical processing. Most of the NoSQL and NewSQL databases provide support for managing both transactional and analytical workloads. Vitess is a database clustering system that can be used to scale and shard MySQL instances. Vitess provides HTAP features on top of MySQL by allowing a given MySQL instance to be configured as master or read-only, where read-only can be used for analytical queries and MapReduce. It is possible to use CockroachDB as HTAP by propagating changes with the help of change data capture (CDC) in the OLTP cluster or primary cluster to a separate cluster, which is solely used for analytical processing.
Now, let's learn a bit about embedded and mobile databases, including why they exist and some of the most popular ones in this space.
Embedded and mobile databases
Embedded databases usually refer to databases that can be tightly integrated into an application, without needing separate hardware to support them. Also, they don't have to be managed separately. Some of the most popular embedded databases include SQLite, Berkeley DB from Oracle Corporation, and SQL Server Compact from Microsoft Corporation. Embedded databases are also very useful for testing purposes as they can be started within test suites.
Mobile database refers to the class of databases that work with very limited memory footprint and compute and can be deployed within a mobile device. They are typically used for storing user data for apps running on mobile devices. SQLite, SQL Server Compact, Oracle database Lite, Couchbase Lite, SQL Anywhere, SQL Server Express, and DB2 Everyplace belong to this category,
Database storage engines
A database storage engine is a component within a database management system that is responsible for Create, Read, Update, Delete (CRUD) operations and transferring data between disk and memory, without compromising data integrity. Some of the most popular ones include Apache Derby, HSQLDB, InfinityDB, LevelDB, RocksDB, and SQLite. CockroachDB initially started with RocksDB as its database engine, but from release 20.2 onward, Pebble will be the database engine by default. Pebble, as per Cockroach Labs, is a RocksDB-inspired and RocksDB-compatible key-value store focused on the needs of CockroachDB. RocksDB was implemented in C++, whereas Pebble was implemented in Golang. This makes it easier to manage and maintain as CockroachDB itself was written in Golang. This means that we only have to deal with one language now.