The evolution of SQL and NoSQL
Structured Query Language (SQL) existed even before the World Wide Web (WWW). Dr. E. F. Codd originally published a paper, A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce, in 1974. Relational Software (now known as Oracle Corporation) was the first to develop a commercially available implementation of SQL, which was targeted at United States governmental agencies.
The first American National Standards Institute (ANSI) SQL standard came out in 1986. Since then, there have been eight revisions, with the most recent being published in 2016 (SQL:2016).
SQL was not particularly popular at the start of the WWW. Static content could just be hardcoded onto the HTML page without much fuss. However, as the functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources, in order to generate content that could change over time without redeploying code.
Common Gateway Interface (CGI) scripts, developing Perl or Unix shells, were driving early database-driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two-tier and three-tier architectures that separated views from the business and model logic, allowing for SQL queries to be modular and isolated from the rest of the web application.
On the other hand, Not only SQL (NoSQL) is much more modern and supervened web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi, in 1998, for his open source database that did not follow the SQL standard but was still relational.
This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm, reintroduced the term in early 2009, in order to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google’s Bigtable and MapReduce papers or Amazon’s DynamoDB, which is a highly available key-value-based storage system.
NoSQL’s foundations grew upon relaxed atomicity, consistency, isolation, and durability (ACID) properties, which guarantee performance, scalability, flexibility, and reduced complexity. Most NoSQL databases have gone one way or the other in providing as many of the previously mentioned qualities as possible, even offering adjustable guarantees to the developer. The following diagram describes the evolution of SQL and NoSQL:
Figure 1.1 – Database evolution
In the next section, we will learn more about how MongoDB has evolved over time, from a basic object store to a full-fledged general-purpose database system.
The evolution of MongoDB
MongoDB Inc’s former name, 10gen Inc., started to develop a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document-oriented database that they built to power it, which was MongoDB. MongoDB shifted from a Platform as a Service (PaaS) to an open source model and released MongoDB version 1.0 on August 27, 2009.
Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees, but it made up for these shortcomings with performance and flexibility.
In the following sections, we will highlight the major features of MongoDB, along with the version numbers with which they were introduced.
The major feature set for versions 1.0 and 1.2
The major new features of versions 1.0 and 1.2 are listed as follows:
- A document-based model
- A global lock (process level)
- Indexes on collections
- CRUD operations on documents
- No authentication (authentication was handled at the server level)
- Primary and secondary replication: Back then, they were named master and slave, respectively, and were changed to their current names with the SERVER-20608 ticket, in version 4.9.0
- MapReduce (introduced in v1.2)
- Stored JavaScript functions (introduced in v1.2)
Version 2
The major new features of version 2 are listed as follows:
- Background index creation (since v1.4)
- Sharding (since v1.6)
- More query operators (since v1.6)
- Journaling (since v1.8)
- Sparse and covered indexes (since v1.8)
- Compact commands to reduce disk usage
- More efficient memory usage
- Concurrency improvements
- Index performance enhancements
- Replica sets are now more configurable and data center-aware
- MapReduce improvements
- Authentication (since 2.0, for sharding and most database commands)
- Geospatial features introduced
- The aggregation framework (since v2.2) and enhancements (since v2.6)
- Time-to-Live (TTL) collections (since v2.2)
- Concurrency improvements, among which there is DB-level locking (since v2.2)
- Text searching (since v2.4) and integration (since v2.6)
- Hashed indexes (since v2.4)
- Security enhancements and role-based access (since v2.4)
- A V8 JavaScript engine instead of SpiderMonkey (since v2.4)
- Query engine improvements (since v2.6)
- A pluggable storage engine API
- A WiredTiger storage engine has been introduced, with document-level locking, while the previous storage engine (now called MMAPv1) supports collection-level locking
Version 3
The major new features of version 3 are listed as follows:
- Replication and sharding enhancements (since v3.2)
- Document validation (since v3.2)
- The aggregation framework’s enhanced operations (since v3.2)
- Multiple storage engines (since v3.2, only in Enterprise Edition)
- Query language and indexes collation (since v3.4)
- Read-only database views (since v3.4)
- Linearizable read concerns (since v3.4)
Version 4
The major new features of version 4 are listed as follows:
- Multi-document ACID transactions (since v4.0)
- Change streams (since v4.0)
- MongoDB tools (Stitch, Mobile, Sync, and Kubernetes Operator) (since v4.0)
- Retryable writes (since v4.0)
- Distributed transactions (since v4.2)
- Removing the outdated MMAPv1 storage engine (since v4.2)
- Updating the shard key (since v4.2)
- On-demand materialized views using aggregation pipelines (since v4.2)
- Wildcard indexes (since v4.2)
- Streaming replication in replica sets (since v4.4)
- Hidden indexes (since v4.4)
Version 5
The major new features of version 5 are listed as follows:
- A quarterly MongoDB release schedule going forward
- Window operators using aggregation pipelines (since v5.0)
- A new MongoDB shell – mongosh (since v5.0)
- Native time series collections (since v5.0)
- Live resharding (since v5.0)
- Versioned APIs (since v5.0)
- Multi-cloud client-side field level encryption (since v5.0)
- Cross-Shard Joins and Graph Traversals (since v5.1)
The following diagram shows MongoDB’s evolution over time:
Figure 1.2 – MongoDB’s evolution
As you can see, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version, such as sharding, usable and spatial indexes, geospatial features, and memory and concurrency improvements.
On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the aging MapReduce framework that didn’t keep up to speed with dedicated frameworks, such as Hadoop. Then, text search was added, and slowly but surely, the performance, stability, and security of the framework improved, adapting to the increasing enterprise loads of customers using MongoDB.
With WiredTiger’s introduction in version 3, locking became much less of an issue for MongoDB, as it was brought down from the process (global lock) to the document level, which is almost the most granular level possible.
Version 4 marked a major transition, bridging the SQL and NoSQL world with the introduction of multi-document ACID transactions. This allowed for a wider range of applications to use MongoDB, especially applications that require a strong real-time consistency guarantee. Further, the introduction of change streams allowed for a faster time to market for real-time applications using MongoDB. Additionally, a series of tools have been introduced to facilitate serverless, mobile, and Internet of Things (IoT) development.
With version 5, MongoDB is now a cloud-first database, with MongoDB Atlas offering full customer support for all major and minor releases going forward. In comparison, non-cloud users only get official support for major releases (for example, version 5 and then version 6). This is complemented by the newly released versioned API approach, which futureproofs applications. Live resharding addresses the major risk of choosing the wrong sharding key, whereas native time series collections and cross-shard lookups using $lookup
and $graphlookup
greatly improve analytics capabilities and unlock new use cases. End-to-end encryption and multi-cloud support can help implement systems in industries that have unique regulatory needs and also avoid vendor locking. The new mongosh shell is a major improvement over the legacy mongo shell.
Version 6 brings many incremental improvements. Now time series collections support sharding, compression, an extended range of secondary indexes, and updates and deletes (with limitations), making them useful for production use. The new slot-based query execution engine can be used in eligible queries such as $group
and $lookup
, improving execution time by optimizing query calculations. Finally, queryable encryption and cluster-to-cluster syncing improve the operational and management aspects of MongoDB.
In its current state, MongoDB is a database that can handle heterogeneous workloads ranging from startup Minimum Viable Product (MVP) and Proof of Concept (PoC) to enterprise applications with hundreds of servers.