Database management systems
Different database management systems support diverse application scenarios, use cases, and requirements. Database management systems have a long history. First we will quickly take a look at the recent history, and then explore the market-dominant database management system categories.
A brief history
Broadly, the term database can be used to present a collection of things. Moreover, this term brings to mind many other terms including data, information, data structure, and management. A database can be defined as a collection or a repository of data, which has a certain structure, managed by a database management system (DBMS). Data can be structured as tabular data, semi-structured as XML documents, or unstructured data that does not fit a predefined data model.
In early days, databases were mainly aimed at supporting business applications; this led us to the well-defined relational algebra and relational database systems. With the introduction of object-oriented languages, new paradigms of database management systems appeared such as object-relational databases and object-oriented databases. Also, many businesses as well as scientific applications use arrays, images, and spatial data; thus, new models such as raster, map, and array algebra are supported. Graph databases are used to support graph queries such as the shortest path from one node to another along with supporting traversal queries easily.
With the advent of web applications such as social portals, it is now necessary to support a huge number of requests in a distributed manner. This has led to another new paradigm of databases called NoSQL (Not Only SQL) which has different requirements such as performance over fault tolerance and horizontal scaling capabilities.
In general, the timeline of database evolution was greatly affected by many factors such as:
- Functional requirements: The nature of the applications using a DBMS has led to the development of extensions on top of relational databases such as PostGIS (for spatial data) or even dedicated DBMS such as SCI-DB (for scientific data analytics).
- Nonfunctional requirements: The success of object-oriented programming languages has created new trends such as object-oriented databases. Object relational database management systems have appeared to bridge the gap between relational databases and the object-oriented programming languages. Data explosion and the necessity to handle terabytes of data on commodity hardware have led to columnar databases, which can easily scale up horizontally.
Database categories
Many database models have appeared and vanished such as the network model and hierarchal model. The predominant categories now in the market are relational, object-relational databases, and NoSQL databases. One should not think of NoSQL and SQL databases as rivals; they are complementary to each other. By utilizing different database systems, one can overcome many limitations, and get the best of different technologies.
The NoSQL databases
The NoSQL databases are affected by the CAP theorem, also known as Brewer's theorem. In 2002, S. Gilbert and N. Lynch published a formal proof of the CAP theorem in their article: "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services". In 2009, the NoSQL movement began. Currently, there are over 150 NoSQL databases (nosql-database.org).
The CAP theorem
The CAP theorem states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:
- Consistency: All clients see (immediately) the latest data even in the case of updates.
- Availability: All clients can find a replica of some data even in the case of a node failure. That means even if some part of the system goes down, the clients can still access the data.
- Partition tolerance: The system continues to work regardless of arbitrary message loss or failure of part of the system.
The choice of which feature to discard determines the nature of the system. For example, one could sacrifice consistency to get a scalable, simple, and high-performance database management system.
Often, the main difference between a relational database and a NoSQL database is consistency. A relational database enforces ACID. ACID is the acronym for the following properties: Atomicity, Consistency, Isolation, and Durability. In contrast, many NoSQL databases adopt the basically available soft-state, eventual-consistency (BASE) model.
NoSQL motivation
A NoSQL database provides a means for data storage, manipulation, and retrieval for non-relational data. The NoSQL databases are distributed, open source and horizontally scalable. NoSQL often adopts the BASE model, which prizes availability over consistency, and informally guarantees that if no new updates are made on a data item, eventually all access to that data item will return the latest version of that data item. The advantages of this approach include the following:
- Simplicity of design
- Horizontal scaling and easy replication
- Schema free
- Huge amount of data support
We will now explore a few types of NoSQL databases.
Key value databases
The key value store is the simplest database store. In this database model, the storage, as its name suggests, is based on maps or hash tables. Some key-value databases allow complex values to be stored as lists and hash tables. Key-value pairs are extremely fast for certain scenarios, but lack the support for complex queries and aggregation. Some of the existing open source key-value databases are Riak, Redis, Memebase, and MemcacheDB.
Columnar databases
Columnar or column-oriented databases are based on columns. Data in a certain column in a two dimensional relation is stored together. Unlike relational databases, adding columns is inexpensive, and is done on a row-by-row basis. Rows can have a different set of columns. Tables can benefit from this structure by eliminating the storage cost of the null values. This model is best suited for distributed databases. HBase is one of the most famous columnar databases. It is based on the Google big table storage system. Column-oriented databases are designed for huge data scenarios, so they scale up easily. For small datasets, HBase is not a suitable architecture. First, the recommended hardware topology for HBase is a five-node or server deployment. Also, it needs a lot of administration, and is difficult to master and learn.
Document databases
A document-oriented database is suitable for documents and semi-structured data. The central concept of a document-oriented database is the notion of a document. Documents encapsulate and encode data (or information) in some standard formats or encodings such as XML, JSON, and BSON. Documents do not adhere to a standard schema or have the same structure; so, they provide a high degree of flexibility. Unlike relational databases, changing the structure of the document is simple, and does not lock the clients from accessing the data.
Document databases merge the power of relational databases and column-oriented databases. They provide support for ad-hoc queries, and can be scaled up easily. Depending on the design of the document database, MongoDB is designed to handle a huge amount of data efficiently. On the other hand, CouchDB provides high availability even in the case of hardware failure.
Graph databases
Graph databases are based on the graph theory, where a database consists of nodes and edges. The nodes as well as the edges can be assigned data. Graph databases allow traversing between the nodes using edges. Since a graph is a generic data structure, graph databases are capable of representing different data. A famous implementation of an open source commercially supported graph databases is Neo4j.
Relational and object relational databases
Relational database management systems are one of the most-used DBMSs in the world. It is highly unlikely that any organization, institution, or personal computer today does not have or use a piece of software that does not rely on RBDMS. Software applications can use relational databases via dedicated database servers or via lightweight RDBMS engines, embedded in the software applications as shared libraries.
The capabilities of a relational database management system vary from one vendor to another, but most of them adhere to the ANSI SQL standards. A relational database is formally described by relational algebra, and is modeled on the relational model. Object-relational database (ORD) are similar to relational databases. They support object-oriented model concepts such as:
- User defined and complex data types
- Inheritance
ACID properties
In a relational database, a single logical operation is called a transaction. The technical translation of a transaction is a set of database operations, which are create, read, update, and delete (CRUD). The simplest example for explaining a transaction is money transfer from one bank account to another, which normally involves debiting one account and crediting another. The ACID properties in this context could be described as follows:
- Atomicity: All or nothing, which means that if a part of a transaction fails, then the transaction fails as a whole.
- Consistency: Any transaction gets the database from one valid state to another valid state. Database consistency is governed normally by data constraints and the relation between data and any combination thereof. For example, imagine if one would like to completely purge his account on a shopping service. In order to purge his account, his account details, such as list of addresses, will also need to be purged. This is governed by foreign key constraints, which will be explained in detail in the next chapter.
- Isolation: Concurrent execution of transactions results in a system state that would be obtained if the transactions were executed serially.
- Durability: The transactions which are committed, that is executed successfully, are persistent even with power loss or some server crashes. This is done normally by a technique called write-ahead log (WAL).
The SQL Language
Relational databases are often linked to the Structured Query Language (SQL). SQL is a declarative programming language, and is the standard relational database language. American National Standard Institute (ANSI) and International standard organization (ISO) published the SQL standard for the first time in 1986, followed by many versions such as SQL:1999, SQL:2003, SQL:2006, SQL:2008, and so on.
The SQL language has several parts:
- Data definition language (DDL): It defines and amends the relational structure.
- Data manipulation language (DML): It retrieves and extracts information from the relations.
- Data control language (DCL): It controls the access rights to relations.
Basic concepts
A relational model is a first-order predicate logic, which was first introduced by Edgar F. Codd. A database is represented as a collection of relations. The state of the whole database is defined by the state of all the relations in the database. Different information can be extracted from the relations by joining and aggregating data from different relations, and by applying filters on the data.
In this section, the basic concepts of the relational model are introduced using the top-down approach by first describing the relation, tuple, attribute, and domain.
Note
The terms relation, tuple, attribute, and unknown, which are used in the formal relational model, are equivalent to table, row, column, and null in the SQL language.
Relation
Think of a relation as a table with a header, columns, and rows. The table name and the header help in interpreting the data in the rows. Each row represents a group of related data, which points to a certain object.
A relation is represented by a set of tuples. Tuples should have the same set of ordered attributes. Attributes have a domain, that is, a type and a name.
Customer relation | |||||
---|---|---|---|---|---|
|
|
|
|
| |
Tuple → |
|
|
|
| |
Tuple → |
|
|
|
|
|
Tuple → |
|
|
|
| |
Tuple → |
|
|
|
| |
↑ Attribute |
↑ Attribute |
↑ Attribute |
↑ Attribute |
↑ Attribute |
The relation schema is denoted by the relation name and the relation attributes. For example, customer (customer_id
, first_name
, last_name
, and Email
) is the relation schema for the customer relation. Relation state is defined by the set of relation tuples; thus, adding, deleting, and amending a tuple will change the relation to another state.
Tuple order or position in the relation is not important, and the relation is not sensitive to tuple order. The tuples in the relation could be ordered by a single attribute or a set of attributes. Also, a relation cannot have duplicate tuples.
A relation can represent entities in the real world, such as a customer, or can be used to represent an association between relations. For example, the customer could have several services, and a service can be offered to several customers. This could be modeled by three relations: customer
, service
, and customer_service
. The customer_service
relation associates the customer and the service relations. Separating the data in different relations is a key concept in relational database modeling. This concept called normalization is the process of organizing relation columns and relations to reduce data redundancy. For example, let us assume a collection of services is stored in the customer relation. If a service is assigned to multiple customers, that would result in data redundancy. Also, updating a certain service would require updating all its copies in the customer table.
Tuple
A tuple is a set of ordered attributes. They are written by listing the elements within parentheses ()
and separated by commas, such as (john, smith, 1971). Tuple elements are identified via the attribute name. Tuples have the following properties:
(a1,a2, a3, …an) = (b1, b2,b3,…,bn ) if and only if a1 = ba , a2=b2, … an= bn
- A tuple is not a set, the order of attributes matters.
(a1, a2) ≠(a2, a1)
(a1, a1) ≠(a1)
- A tuple has a finite set of attributes
In the formal relational model, multi-valued attributes as well as composite attributes are not allowed. This is important to reduce data redundancy and increasing data consistency. This isn't strictly true in modern relational database systems because of the utilization of complex data types such as JSON and key-value stores. There is a lot of debate regarding the application of normalization; the rule of thumb is to apply normalization unless there is a good reason not to do so.
Another important concept is that of the unknown values, that is, NULL
values. For example, in the customer relation, the phone number of a customer might be unknown. Predicates in relational databases uses three-valued logic (3VL), where there are three truth values: true, false, and unknown. In a relational database, the third value, unknown, can be interpreted in many ways, such as unknown data, missing data, or not applicable. The three-valued logic is used to remove ambiguity. Imagine two tuples in the customer relation with missing phone values; does that mean both have the same phone, that is, NULL=NULL
? The evaluation of the expression NULL=NULL
is also NULL
.
Attribute
Each attribute has a name and a domain, and the name should be distinct within the relation. The domain defines the possible set of values that the attribute can have. One way to define the domain is to define the data type and a constraint on this data type. For example, hourly wage should be a positive real number and bigger than five if we assume the minimum hourly wage is five dollars. The domain could be continuous, such as salary which is any positive real number, or discrete, such as gender.
The formal relational model puts a constraint on the domain: the value should be atomic. Atomic means that each value in the domain is indivisible. For instance, the name
attribute domain is not atomic, because it can be divided into first name and last name. Some examples of domains are as follows:
- Phone number: Numeric text with a certain length.
- Country code: Defined by ISO 3166 as a list of two letter codes (ISO alpha-2) and three letter codes (ISO alpha-3). The country codes for Germany are DE and DEU for alpha-2 and alpha-3 respectively.
Tip
It is good practice if you have lookup tables such as country code, currency code, and languages to use the already defined codes in ISO standards, instead of inventing your own codes.
Constraint
The relational model defines many constraints in order to control data integrity, redundancy, and validity.
- Redundancy: Duplicate tuples are not allowed in the relation.
- Validity: Domain constraints control data validity.
- Integrity: The relations within a single database are linked to each other. An action on a relation such as updating or deleting a tuple might leave the other relations in an invalid state.
We could classify the constraints in a relational database roughly into two categories:
- Inherited constraints from the relational model: Domain integrity, entity integrity, and referential integrity constraints.
- Semantic constraint, business rules, and application specific constraints: These constraints cannot be expressed explicitly by the relational model. However, with the introduction of procedural SQL languages such as PL/pgsql for PostgreSQL, relational databases can also be used to model these constraints.
Domain integrity constraint
The domain integrity constraint ensures data validity. The first step in defining the domain integrity constraint is to determine the appropriate data type. The domain data types could be integer
, real
, boolean
, character
, text
, inet
, and so on. For example, the data type of first name and e-mail address is text. After specifying the data type, check constraints, such as the mail address pattern, need to be defined.
- Check constraint: A check constraint can be applied to a single attribute or a combination of many attributes in a tuple. Let us assume that
customer_service
schema is defined as (customr_id
,service_id
,start_date
,end _date
,order_date
). For this relation, we can have a check constraint to make sure thatstart_date
andend_date
are entered correctly by applying the following check (start_date<end_date
). - Default constraint: The attribute can have a default value. The default value could be a fixed value such as the default hourly wage of the employees , for example, $10. It may also have a dynamic value based on a function such as random, current time, and date. For example, in the
customer_service
relation,order_date
can have a default value which is the current date. - Unique constraint: A unique constraint guarantees that the attribute has a distinct value in each tuple. It allows null values. For example, let us assume we have a relation player defined as player (
player_id
,player_nickname
). The player uses his ID to play with others; he can also pick up a nickname which is not used by someone else. - Not null constraint: By default, the attribute value can be null. The not null constraint restricts an attribute from having a null value. For example, each person in the birth registry record should have a name.
Entity integrity constraint
In the relational model, a relation is defined as a set of tuples. By definition, the element of the set is distinct. This means that all the tuples in a relation must be distinct. The entity integrity constraint is enforced by having a primary key which is an attribute/set of attributes having the following characteristics:
- The attribute should be unique
- The attributes should be not null
Each relation must have only one primary key, but can have many unique keys. A candidate key is a minimal set of attributes which can identify a tuple. All unique, not null
attributes can be candidate keys. The set of all attributes form a super key. In practice, we often pick up a single attribute to be a primary key instead of a compound key (key that consists of two or more attributes that uniquely identify a tuple) to reduce data redundancy, and to ease the joining of the relations with each other.
If the primary key is generated by the DBMS, then it is called a surrogate key. Otherwise, it is called a natural key. The surrogate key candidates can be sequences and universal unique identifiers (UUID). A surrogate key has many advantages such as performance, requirement change tolerance, agility, and compatibility with object relational mappers. More on surrogate keys will be covered in the following chapters.
Referential integrity constraints
Relations are associated with each other via common attributes. Referential integrity constraints govern the association between two relations, and ensure data consistency between tuples. If a tuple in one relation references a tuple in another relation, then the referenced tuple must exist. In the customer service example, if a service is assigned to a customer, then the service and the customer must exist as shown in the following example. For instance, in the customer_service
relation, we cannot have a tuple with values (5
, 1
,01-01-2014
, NULL
), because we do not have a customer with customer_id
equal to 5.
The lack of referential integrity constraints can lead to many problems such as:
- Invalid data in the common attributes
- Invalid information during joining of data from different relations.
Referential integrity constraints are achieved via foreign keys. A foreign key is an attribute or a set of attributes that can identify a tuple in the referenced relation. Since the purpose of a foreign key is to identify a tuple in the referenced relation, foreign keys are generally primary keys in the referenced relation. Unlike a primary key, a foreign key can have a null value. It can also reference a unique attribute in the referenced relation. Allowing a foreign key to have a null value enables us to model different cardinality constraints. Cardinality constraints define the participation between two different relations. For example, a parent can have more than one child; this relation is called one-to-many relationship, because one tuple in the referenced relation is associated with many tuples in the referencing relation. Also, a relation could reference itself. This foreign key is called a self-referencing or recursive foreign key. For example, a company acquired by another company:
|
|
|
|
| |
|
|
|
Primary key |
Foreign key |
Recursive foreign key
To ensure data integrity, foreign keys can be used to define several behaviors when a tuple in the referenced relation is updated or deleted. The following behaviors are called referential actions:
- Cascade: When a tuple is deleted or updated in the referenced relation, the tuples in the referencing relation are also updated or deleted.
- Restrict: The tuple cannot be deleted or the referenced attribute cannot be updated if it is referenced by another relation.
- No action: Similar to restrict, but it is deferred to the end of the transaction.
- Set default: When a tuple in the referenced relation is deleted or the referenced attribute is updated, then the foreign key value is assigned the default value.
- Set null: The foreign key attribute value is set to null when the referenced tuple is deleted.
Semantic constraints
Semantic integrity constraints or business logic constraints describe the database application constraints in general. Those constraints are either enforced by the business logic tier of the application program or by SQL procedural languages. Trigger and rule systems can also be used for this purpose. For example, the customer should have at most one active service at a time. Based on the nature of the application, one could favor using an SQL procedural language or a high-level programming language to meet the semantic constraints. The advantages of using the SQL programming language are:
- Performance: RDBMSs often have complex analyzers to generate efficient execution plans. Also, in some cases such as data mining, the amount of data that needs to be manipulated is very large. Manipulating the data using procedural SQL language eliminates the network data transfer. Finally, some procedural SQL languages utilize clever caching algorithms.
- Last minute change: For the SQL procedural languages, one could deploy bug fixes without service disruption.