Database management systems

Different database management systems (DBMS) support diverse application scenarios, use cases, and requirements. DBMS have a long history. First, we will quickly take a look at the recent history, and then explore the market-dominant database management system categories.

A brief history

Broadly, the term database can be used to present a collection of things. Moreover, this term brings to mind many other terms including data, information, data structure, and management. A database can be defined as a collection or repository of data, which has a certain structure, managed by a DBMS. Data can be structured as tabular data, semi-structured as XML documents, or unstructured data that does not fit a predefined data model.

In the early days, databases were mainly aimed at supporting business applications; this led us to the well-defined relational algebra and relational database systems. With the introduction of object-oriented languages, new paradigms of DBMS appeared, such as object-relational databases and object-oriented databases. Also, many businesses, as well as scientific applications, use arrays, images, and spatial data; thus, new models such as raster, map, and array algebra are supported. Graph databases are used to support graph queries such as the shortest path from one node to another, along with supporting traversal queries easily.

With the advent of web applications such as social portals, it is now necessary to support a huge number of requests in a distributed manner. This has led to another new paradigm of databases called Not Only SQL (NoSQL), which has different requirements, such as performance over fault tolerance and horizontal scaling capabilities. In general, the timeline of database evolution was greatly affected by many factors, such as the following:

Functional requirements: The nature of the applications using a DBMS led to the development of extensions on top of relational databases such as PostGIS (for spatial data), or even a dedicated DBMS such as SciDB (for scientific data analytics).
Nonfunctional requirements: The success of object-oriented programming languages has created new trends such as object-oriented databases. Object-relational DBMS have appeared to bridge the gap between relational databases and the object-oriented programming languages. Data explosion and the necessity to handle terabytes of data on commodity hardware have led to database systems that can easily scale up horizontally.

Database categories

Many database models have appeared and vanished, such as the network model and the hierarchical model. The predominant categories now in the market are relational, object-relational, and NoSQL databases. One should not think of NoSQL and SQL databases as rivals; they are complementary to each other. By utilizing different database systems, one can overcome many limitations and get the best of different technologies.

NoSQL databases can provide great benefits such as availability, schema-free, and horizontal scaling, but they also have limitations such as performance, data retrieval constraints, and learning time. Relational databases often adhere to SQL as defined by ISO. SQL is a very expressive and extremely powerful tool for retrieving data in different forms. Many NoSQL databases such as Cassandra lack the capability to retrieve data as in relational databases.

NoSQL databases

NoSQL databases are affected by the CAP theorem, also known as Brewer's theorem. In 2002, S. Gilbert and N. Lynch published a formal proof of the CAP theorem in their article Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. In 2009, the NoSQL movement began. Currently, there are over 150 NoSQL databases (nosql-database.org).

The CAP theorem

The CAP theorem states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:

Consistent: All clients see (immediately) the latest data even in the case of updates.
Available: All clients can find a replica of some data even in the case of a node failure. This means that even if some part of the system goes down, the clients can still access consistent and valid data.
Partition tolerance: The system continues to work regardless of arbitrary message loss or failure of part of the system.

The choice of which features to discard determines the nature of the system. For example, one could sacrifice consistency to get a scalable, simple, and high-performance database management system. Often, the main difference between a relational database and a NoSQL database is consistency. A relational database enforces atomicity, consistency, isolation, and durability (ACID) properties. In contrast, many NoSQL databases adopt the basically available, soft-state, eventual consistency (BASE) model.

NoSQL motivation

A NoSQL database does not entity relation model for data storage, manipulation, and retrieval. NoSQL databases are often distributed, open source, and horizontally scalable. NoSQL often adopts the base model, which prizes availability over consistency, and informally guarantees that if no new updates are made on a data item, eventually all access to that data item will return the latest version of that data item. The advantages of this approach include the following:

Simplicity of design
Horizontal scaling and easy replication

Schema-free
A huge amount of data support

We will now explore a few types of NoSQL databases.

Key-value databases

The key-value store is the simplest database store. In this database model, the storage, as its name suggests, is based on maps or hash tables. Some key-value databases allow complex values to be stored as lists and hash tables. Key-value pairs are extremely fast for certain scenarios but lack the support for complex queries and aggregation. Some of the existing open source key-value databases are Riak, Redis, Couchbase Server, and MemcacheDB.

Columnar databases

Columnar or column-oriented databases are based on columns. Data in a certain column in a two-dimensional relation is stored together.

Unlike relational databases, adding columns is inexpensive and is done on a row-by-row basis. Rows can have a different set of columns. Tables can benefit from this structure by eliminating the storage cost of the null values. This model is best suited for distributed databases.

HBase is one of the most famous columnar databases. It is based on the Google Bigtable storage system. Column-oriented databases are designed for huge data scenarios, so they scale up easily. For example, Facebook uses HBase to power their message infrastructure. For small datasets, HBase is not a suitable architecture. First, the recommended hardware topology for HBase is a five-node server deployment. Also, it needs a lot of administration and is difficult to learn and master.

Document databases

A document-oriented database is suitable for documents and semi-structured data. The central concept of a document-oriented database is the notion of a document. Documents encapsulate and encode data (or information) in some standard formats or encodings such as XML, JSON, and BSON. Documents do not adhere to a standard schema or have the same structure, so they provide a high degree of flexibility. Unlike relational databases, changing the structure of the document is simple and does not lock the clients from accessing the data.

Document databases merge the power of relational databases and column-oriented databases. They provide support for ad hoc queries and can be scaled up easily. Depending on the design of the document database, MongoDB is designed to handle a huge amount of data efficiently. On the other hand, CouchDB provides high availability even in the case of hardware failure.

Graph databases

Graph databases are based on graph theory, where a database consists of nodes and edges. The nodes, as well as the edges, can be assigned data. Graph databases allow traversing between the nodes using edges. As a graph is a generic data structure, graph databases are capable of representing different data. A famous implementation of an open source, commercially supported graph database is Neo4j.