The RethinkDB data model
The RethinkDB data model consists of two main components:
- RethinkDB data types
- RethinkDB model relationship
RethinkDB data types are further classified into basic data types, dates, binary objects, and geospatial queries.
Refer to the following diagram for more details:
Let's go over to each component in detail.
RethinkDB data types
RethinkDB provides six basic data types; they are represented in tabular format, as shown in the following table:
Data type |
Description |
|
Numbers are stored using double-precision floating-point numbers. |
|
Strings are stored in UTF-8 standard. |
|
Stored as a key-value pair with standard JSON format. |
|
Stored as a list of elements. It supports 100,000 elements by default. |
|
Binary objects includes files, images, and other binary data. |
|
RethinkDB stores date and time with millisecond precision. |
|
True and false values. |
|
Stores null values. |
RethinkDB also provides extended data types supported by it. They are as follows:
- Tables: These are RethinkDB tables. You can insert and delete documents from them with proper indexing.
- Streams: Streams provide the transfer of large amounts of data in chunks so that the system won't reach the buffer overflow limit. They are loaded in a lazy fashion. Streams provide the navigation point to a chunk, called cursor, which you can use to traverse the result set. You can club all of them once the streams are collected using the cursor. This makes it easy to read large amounts of data. Streams are read-only operations.
- Selections: Selections are subsets of data, such as 10 documents out of 1,000 from a table. There are two types of selections, one with objects and one with streams. The difference between the two is that objects can be writable while streams are read-only. You can pass an object selected by, say, the
get()
command to other commands to manipulate the data. - Date and time: Date and time are stored in RethinkDB with millisecond precision, along with time zones. Currently, minute-precision time offsets from UTC are supported. RethinkDB internally calculates the time zone difference; hence you don't need to do it at the client end.
You can use native date commands supported by RethinkDB drivers such as in Node.js; you can use the Date()
object.
Binary objects
Binary objects are stored similar to BLOB in SQL databases. You can directly read and upload files, images, and so on directly into the database. Parsing and other dirty tasks will be dealt with by RethinkDB.
One of the amazing functionalities of RethinkDB is calling external APIs from RethinkDB native drivers. So, consider you want to add profile pictures of users directly into the database from Gravatar, a common place of avatars. All you need to do is call the API and convert the detail into binary. Consider the following official code snippet:
var hash = md5(email); gravatarUrl = 'http://gravatar.com/avatar/' + hash + '?d=retro'; r.table('users').get(userId).update({ gravatar: r.http(gravatarUrl, {resultFormat: 'binary'}) }).run(conn, callback)
Assume that conn
is the RethinkDB connection and we are passing the email
server of the user to get the avatar. If you notice, we are calling the gravatar
API using the http()
function and converting resultformat
into binary
.
Specifying the result format is not mandatory here; if the MIME type of the calling server is set correctly, you can just call the HTTP URL and it will be converted to the default MIME type, say binary. It's better to be on the safer side though.
Geospatial queries in RethinkDB
RethinkDB provides and supports geospatial features to help you build location-based applications. By using geospatial queries, you can easily parse, convert, and perform lots of operations, such as computing the distance between two locations, finding an intersecting location, and many more. RethinkDB stores all geographical information in the GeoJSON standard format.
RethinkDB uses a geometry object to perform geographical queries. Geometry objects are derived by plotting two-dimensional objects such as lines and points on a sphere in three-dimensional space.
RethinkDB stores information in a standard geographic system, which is addressing a point on a surface in longitude and latitude. It does not support elevation yet. The range of longitudes is -180 to 180 and the range of latitudes is -90 to 90. To store the same, we use the point()
function of RethinkDB. Here is some sample code, assuming r
is a RethinkDB object:
r.table('demo').insert([ { name : "Mumbai", location : r.point(19.0760,72.8777) } ])
Supported data types
Geospatial data types and functions are derived from three geometric object data types: points, lines, and polygons.
By using these three, RethinkDB provides various geospatial functions such as circle()
, intersect()
, getNearest()
, and so on.
RethinkDB model relationships
RethinkDB, besides being a NoSQL database, provides one of the most requested features by SQL developers, that is, JOINS. RethinkDB allows you to model and structure your data in such a way that allows you to perform JOINS over it. There are two ways to model relationships in RethinkDB:
- By using embedded arrays in the document
- By linking documents stored in multiple tables
Let's see how both of them work, along with their merits and demerits.
Embedded arrays
Embedded arrays basically means using an array of objects in the table in each document. Here is a simple example:
{ "id": "7644aaf2-9928-4231-aa68-4e65e31bf219", "name": "Shahid", "chapters": [ {"title": "Chapter 01", "pages": "30+"}, {"title": "Chapter 02", "pages": "30+"} ] }
Here, the "chapters"
key contains the array of objects, and each object contains the name of the chapter and the page count. In order to perform the query, here is how we do it in RethinkDB.
To display all chapters from table:
r.db("book").table("chapters").run()
To get all the chapters written by a particular author; we can do that by passing the author ID:
r.db("book").table("chapters").get('7644aaf2-9928-4231-aa68-4e65e31bf219').run()
Here, we are passing the ID of the document, which is a system-generated primary key ID by RethinkDB. Upon passing it, we can access a single document from the table. This is a very basic way of modelling and is not recommended for production use.
Merits of embedded arrays
The following merits are as follows:
- Easy to design and query
- Updates on a single document will automatically update chapters and authors both. Hence, both pieces of data will be atomic
Demerits of embedded arrays
As soon as the size of an embedded array exceeds a certain limit, it costs too much computing power to load the entire array every time we have to perform an operation.
Document linking in multiple tables
This is similar to the foreign key relationships we make in traditional SQL databases. Here, we link two or more tables using the index. We can also query the table using functions provided by RethinkDB to perform various joins (inner, outer, and so on):
As shown in the preceding diagram, we have two tables with multiple documents in them. The Authors table will just contain documents related to authors, say name, location, and (obviously) a primary key.
The Chapters table will contain documents of chapters; each chapter is a separated document instead of an array. The only difference is that each document will contain an index with the same value as the primary key from the Authors table. This way, we link both of them together.
As you can see in the diagram, there is a key in every document in the Chapters table named author_id
with the same value as id
in the Authors table.
Here is how we query the tables to perform JOINS:
r.db("books").table("chapters"). filter({"author_id": "abc768332"}). run()
This will return every chapter for the author having ID as abc768332
.
You can also use RethinkDB JOINS functions such as eq_join()
; we will look at the in the next section.
Merits of document linking
The following merits are:
- Data distribution and very neat design
- No need to load all of the data of chapters at once in order to perform any operation
- No matter what the size of documents in chapters table is, the performance won't be affected as in embedded arrays
Demerits of document linking
The following demerits are:
- There are no foreign key constraints, so linking of data will be complicated
- In the case of an update in one table, it won't automatically update the data in another table
Here is a list of JOIN commands provided by RethinkDB:
- eq_Join
- innerJoin
- outerJoin
- Zip
Before looking at these, let's populate some data in two tables. We have authors and chapters in the books database.
We will add author details in the authors table:
r.db('books').table('authors').insert({name : "Shahid", location : "Mumbai"})
We will be adding chapter details in the chapters table:
r.db('books').table('chapters').insert({ author_id : "521b92b1-0d83-483d-a374-b94a400cf699", chapterName : "Chapter 1" }) r.db('books').table('chapters').insert({ author_id : "521b92b1-0d83-483d-a374-b94a400cf699", chapterName : "Chapter 2" })
Performing eq_Join
Here we are joining the chapters table with the authors table and mentioning which field we have mapped to the ID, that is, author_id
:
r.db('books').table('chapters').eqJoin("author_id",r.db('books').table('authors'))
This will return the following:
[ { "left":{ "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 1", "id":"f0b5b2f7-1f82-41ef-a945-f5fa8259dd53" }, "right":{ "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" } }, { "left":{ "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 2", "id":"f58826d4-e259-4ae4-91e4-d2e3db2d9ad3" }, "right":{ "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" } } ]
The left and right keys supposedly represent the tables on the left side and the right side of the query. If you map the keys and values with the ReQL query, you can easily understand.
However, this is not how we really want our data. We want processed, result-oriented data, and to do that, we need to use the zip()
command. It basically removes all metadata information from the result and gives you only the documents of tables:
r.db('books').table('chapters').eqJoin("author_id",r.db('books').table('authors')).zip()
This returns the following:
[ { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 1", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" }, { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 2", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" } ]
Performing inner joins
An inner join, as we all know, equates two tables by comparing each row. It is very similar to working with EQJOIN, except that it compares each document in the table against the target table, while in EQJOIN, we specify which key to compare for.
Here is a simple query to perform an inner join:
r.db('books').table('chapters').innerJoin(r.db('books').table('authors'),function(chapters,authors) { return chapters }).zip()
This function takes the target table as a parameter and the callback function, which contains data of both the tables in the callback argument. If you notice, for understanding, I've named the callback parameters the same as the table name. You can perform a lot of other operations such as comparison, or filtering inside the callback function and then returning the result. Since it's a JOIN, in both the variables, data will be similar-except with different table ID's.
Here is the result for the same:
[ { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 1", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" }, { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 2", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" } ]
If you observe, the result set is pretty much the same as EQJOIN (the eq_Join()
function), except that it provides you the result of each document under the callback function. This makes it really slow, and the RethinkDB team does not recommend it for use in production.
Performing outer joins
Outer join union the result of left join and right join and returns it to the client. So, basically, the result test from both the tables will be combined and returned. Here is a sample query.
r.db('books').table('chapters').outerJoin(r.db('books').table('authors'),function(chapters,authors) { return authors }).zip()
This will combine each document of chapters
with each document of authors
and return the result. Again, we can access each document of the query here under the callback.
It shall return the following:
[ { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 1", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" }, { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 2", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" } ]
In order to check it is working, let's just create one more author in the authors table and not create any chapter document entry for it:
r.db('books').table('authors').insert({name : "RandomGuy", location : "California"})
Upon running the outer join query again, here is the result:
[ { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 1", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" }, { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 2", "id":"521b92b1-0d83-483d-a374-b94a400cf699", "location":"Mumbai", "name":"Shahid" }, { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 1", "id":"78acabb5-a5b8-434b-acb1-52507b71831d", "location":"California", "name":"RandomGuy" }, { "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", "chapterName":"Chapter 2", "id":"78acabb5-a5b8-434b-acb1-52507b71831d", "location":"California", "name":"RandomGuy" } ]
Hence we get all of the result-the union of each document of the table present in the left table (that is, authors) with each document present in the right table (that is, chapters).
Zip
This function performs the merging of left fields with right fields with a JOIN
operation into a single dataset.