What is DocumentDB?
The short answer to this question is that DocumentDB is a managed JSON document database service. But what is the impact on our programming paradigms? How can we use it? Why should we use it? Can it really make our life easier? The answers to these kinds of questions are a bit more involved and need additional clarification.
This section describes the fundamentals of DocumentDB and can help you decide whether or not it will be a good fit for your solution.
Microsoft built DocumentDB from the ground up because the feedback they got from customers was that they "…need a database that can keep pace with their rapidly evolving applications…." Schema-free databases are increasingly popular, but running these on our premises can be expensive and difficult to scale. Combining this with the need for rich querying and transactions still being available, Microsoft decided to build DocumentDB.
This brings us to the longer version of our answer, which is that DocumentDB is a "…a massively scalable, schema-free database with rich query and transaction processing using the most ubiquitous programming language, JavaScript, data model (JSON), and transport protocol (HTTP)…" (http://blogs.msdn.com/b/documentdb/archive/2014/08/22/introducing-azure-documentdb-microsoft-s-fully-managed-nosql-document-database-service.aspx).
The characteristics of a schema
As stated before, NoSQL databases are gaining popularity and are slowly replacing traditional relational databases. The main characteristics of a NoSQL database are listed next:
- Schema-less, with the ability to store everything
- Non-relational
- Extremely scalable
Note
Besides DocumentDB databases, there are other NoSQL databases available, such as graphs and key-value databases. We will study a comparison later in this chapter.
Having no schema (or predefined structure like tables and columns) allows us to store everything. This also includes attachments, user-defined functions, stored procedures, triggers, and more. The only restriction is that the information has to be in valid JSON.
Having JavaScript at the core
The SQL language that can be used to query and manipulate DocumentDB is based on JavaScript. Having JavaScript at the core means that we do not need to learn new techniques or languages, and our current knowledge of JavaScript can be applied immediately. Using JavaScript is a natural way of working with JSON. JSON parsers are perfectly capable of converting query results into variables, manipulating them, and writing them back to the database. Besides working as a client with JavaScript, the internals are also based on JavaScript. The following entities are written in JavaScript as well:
- Stored procedures (SPs): These are executed by issuing an HTTP POST request. Inside the SP, the elements of the designated document(s) are copied to ordinary JavaScript variables. The logic inside the SP then manipulates the data and when the SP finishes, the values are persisted in the document(s) again.
- User-defined functions (UDFs): The difference between UDFs and SP is that UDFs do not manipulate databases or documents themselves. A UDF encapsulates logic or business rules that can be called from SP or queries and can help extend the query language. A good example of a UDF is a function called
calculateAge()
that takes the date of birth of a person and returns their age as a value. ThecalculateAge()
function can be used from a query returning only those persons that are older than 40 years. The query is as follows:SELECT * from people p where calculateAge(p.dob) > 40,
- Triggers: A trigger is a piece of JavaScript code (comparable to UDFs and SPs), but which is only invoked after some event that happens inside your database. A document being created or deleted could result in a trigger being executed. Triggers can be executed before or after the actual event happens. When a trigger fails or raises an exception, the actual operation is aborted and the transaction is not committed but rolled back. This is useful when we need to validate the incoming data to keep our documents consistent.
We will provide extensive examples of SPs, user-defined functions and triggers later in this book.
Indexing a document
In traditional relational databases, the DBA or developer needs to choose the (clustered) indexes. Choosing the right indexing strategy is vital for the performance and consistency of the database.
In DocumentDB, we do not need to choose the index ourselves. In fact, all information inside a document is indexed. This means that we can query on any attribute that is available inside the document. We can choose different indexing policies, but for most applications the default indexing policy will be the best choice between performance and storage efficiency. We can reduce storage space by excluding certain paths within the document used for indexing.
The indexing process inside DocumentDB treats the documents as trees. There needs to be a top node that is the entry point for all the fields inside the document. Imagine a document containing information about a person in the following JSON representation:
{ "firstname": "John", "lastname": "Doe", "dob", "01-01-1960", "hobbies": [ { "type":"sports", "description":"soccer"}, { "type":"reading", "preferences": [ { "type":"scifi"}, { "type":"thriller"} ] } ] }
This JSON snippet describes a person, John Doe, who was born on January 1, 1960, and has two hobbies, sports and reading. His reading hobby focuses on the sci-fi and thriller genres.
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
A JSON document can be depicted like this:
The blue squares are nodes that are implicitly added by the system and do not influence our data model. The figure shows that documents are internally represented as trees. As you can see, the nodes that describe a hobby do not necessary have to be the same in schema. Go ahead and try to build this model in a traditional relational database system!
DocumentDB as a service
Microsoft offers DocumentDB as part of their online offerings on the Microsoft Azure platform. Their as-a-service approach enables developers to start using new technologies immediately.
Understanding performance
The performance of our DocumentDB system is influenced by a performance level. Performance levels are set on a collection and not a database. This enables fine-tuning of your environment, giving the appropriate performance boost to the right resources. Setting the performance level influences the number of so-called request units. A request unit is a measure for the resources (CPU, memory) needed to perform a certain operation.
There are three performance levels:
- S1: Allows up to 250 request units per second
- S2: Allows up to 1,000 request units per second
- S3: Allows up to 2,500 request units per second
We need to choose the performance level carefully, since it comes with a price impact. We will discuss the pricing of DocumentDB later in this chapter.
Handling transactions
DocumentDB also supports transactions providing Atomicity, Consistency, Isolation, Durability (ACID) guarantees. Atomicity enables all operations to be executed as a single piece of work, all being committed at once or not at all. Consistency implies that all data is in the right state across transactions. Isolation makes sure that transactions do not interfere with each other, and durability ensures that all changes that are committed to the database will always be available.
Since JavaScript is executing under snapshot isolation belonging to the collection, SPs and triggers are executed within the same scope, enabling ACID for all operations inside SPs and triggers. If an error occurs in the JavaScript logic, the transaction is automatically rolled back.
Common use cases
Now that we have seen a little of DocumentDB, how can we decide whether DocumentDB is applicable for our own problem scenario? In which scenarios is it a good fit and are there any trade-offs?
Building the Internet of Things
A good example of a problem domain in which DocumentDB fits is the domain of the Internet of Things (IoT). The IoT is all about ingesting, egressing, processing, and storing data (visit https://en.wikipedia.org/wiki/Internet_of_Things). It involves data flowing to and from devices, backend services processing that data or controlling devices, storage services persisting that data, or running statistical analysis or analytics on that data. Because DocumentDB can connect to HDInsight (http://azure.microsoft.com/en-us/services/hdinsight/) and Hadoop, the data can be analyzed easily.
Another good area in the IoT domain is device registration. Each and every device in the field is described inside a single document and stored in DocumentDB. These documents contain information for the device to be able to play the game of IoT, having keys and endpoints to communicate with and enable ingress and egress dataflows.
Throughout this book, we will also take the IoT domain as our main example domain. Examples and code snippets will focus on this area because it is a good area to project the possibilities of DocumentDB on.
Storing user profile information
Storing user profile information inside DocumentDB can be really helpful when it comes to personalized user interfaces or other preferences that can influence an application's behavior or user interface settings.
Note
JavaScript can easily interpret JSON data and is therefore an excellent candidate for describing the markup of a personalized user interface. Extending this thought, the schema-free approach of DocumentDB also makes it an excellent candidate for a CMS system.
Every user is reflected in a single document that describes all user preferences. The list of preferences can be easily extended by adding information to the document. Consider that users authenticate at an authentication service, for example, Azure Active Directory, Facebook, or Twitter, and that these services return a claim set, including a unique identifier called nameidentifier. This field is an excellent candidate for providing the unique entry point in our DocumentDB system and retrieving the user's profile information after logging in.
Logging information
A well-designed system usually emits logging information in large quantities and contains different types of information. Logging information is straightforward and contains information about a specific event, for example, a user logging in to the system, an exception raised by the system, or an audit trail record that needs to be persisted.
Because DocumentDB automatically indexes all documents, querying data and finding fault causes can be very quick. You can take DocumentDB information offline and store it in a datacenter for further analysis with tools like Hadoop or Power BI.
Building mobile solutions
Building and releasing mobile solutions is tough because we might have millions of customers. Using a schema-free database, it is easier to release new apps with additional data while still being able to service your old versions as well. Remember the troubles we had releasing a new schema of our SQL Server or Oracle database? Adding new tables and columns because of new features, and writing conversion scripts for every new release of the system?
By using a JSON document, we can easily add or remove information, release at a faster pace, and enable development in sprints—changing the data each sprint without the pain of conversion scripts.
Of course, the powerful scaling of DocumentDB is also a great help when building global, mobile apps servicing millions of users!