In this article by Mohsin Hijazee, the author of the book Mastering Google App Engine, we will go through learning, but unlearning something is even harder. The main reason why learning something is hard is not because it is hard in and of itself, but for the fact that most of the times, you have to unlearn a lot in order to learn a little. This is quite true for a datastore. Basically, it is built to scale the so-called Google scale. That's why, in order to be proficient with it, you will have to unlearn some of the things that you know. Your learning as a computer science student or a programmer has been deeply enriched by the relational model so much so that it is natural to you. Anything else may seem quite hard to grasp, and this is the reason why learning Google datastore is quite hard.
However, if this were the only glitch in all that, things would have been way simpler because you could ask yourself to forget the relational world and consider the new paradigm afresh. Things have been complicated due to Google's own official documentation, where it presents a datastore in a manner where it seems closer to something such as Django's ORM, Rails ActiveRecord, or SQLAlchemy. However, all of a sudden, it starts to enlist its limitations with a very brief mention or, at times, no mention of why the limitations exist.
Since you only know the limitations but not why the limitations are there in the first place, a lack of reason may result to you being unable to work around those limitations or mold your problem space into the new solution space, which is Google datastore.
We will try to fix this. Hence, the following will be our goals in this article:
To understand BigTable and its data model
To have a look at the physical data storage in BigTable and the operations that are available in it
To understand how BigTable scales
To understand datastore and the way it models data on top of BigTable
So, there's a lot more to learn. Let's get started on our journey of exploring datastore.
The BigTable
If you decided to fetch every web page hosted on the planet, download and store a copy of it, and later process every page to extract data from it, you'll find out that your own laptop or desktop is not good enough to accomplish this task. It has barely enough storage to store every page. Usually, laptops come with 1 TB hard disk drives, and this seems to be quite enough for a person who is not much into video content such as movies.
Assuming that there are 2 billion websites, each with an average of 50 pages and each page weighing around 250 KB, it sums up to around 23,000+ TB (or roughly 22 petabytes), which would need 23,000 such laptops to store all the web pages with a 1 TB hard drive in each.
Assuming the same statistics, if you are able to download at a whopping speed of 100 MBps, it would take you about seven years to download the whole content to one such gigantic hard drive if you had one in your laptop.
Let's suppose that you downloaded the content in whatever time it took and stored it. Now, you need to analyze and process it too. If processing takes about 50 milliseconds per page, it would take about two months to process the entire data that you downloaded. The world would have changed a lot by then already, leaving your data and processed results obsolete.
This is the Kind of scale for which BigTable is built. Every Google product that you see—Search Analytics, Finance, Gmail, Docs, Drive, and Google Maps—is built on top of BigTable. If you want to read more about BigTable, you can go through the academic paper from Google Research, which is available at http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.
The data model
Let's examine the data model of BigTable at a logical level. BigTable is basically a key-value store. So, everything that you store falls under a unique key, just like PHP' arrays, Ruby's hash, or Python's dict:
# PHP
$person['name'] = 'Mohsin';
# Ruby or Python
person['name'] = 'Mohsin'
However, this is a partial picture. We will learn the details gradually in a while. So, let's understand this step by step.
A BigTable installation can have multiple tables, just like a MySQL database can have multiple tables. The difference here is that a MySQL installation might have multiple databases, which in turn might have multiple tables. However, in the case of BigTable, the first major storage unit is a table.
Each table can have hundreds of columns, which can be divided into groups called column families. You can define column families at the time of creating a table. They cannot be altered later, but each column family might have hundreds of columns that you can define even after the creation of the table. The notation that is used to address a column and its column families is like job:title, where job is a column family and title is the column. So here, you have a job column family that stores all the information about the job of the user, and title is supposed to store the job title. However, one of the important facts about these columns is that there's no concept of datatypes in BigTable as you'd encounter in other relational database systems. Everything is just an uninterpreted sequence of bytes, which means nothing to BigTable. What they really mean is just up to you. It might be a very long integer, a string, or a JSON-encoded data.
Now, let's turn our attention to the rows. There are two major characteristics of the rows that we are concerned about. First, each row has a key, which must be unique. The contents of the key again consist of an uninterpreted string of bytes that is up to 64 KB in length. A key can be anything that you want it to be. All that's required is that it must be unique within the table, and in case it is not, you will have to overwrite the contents of the row with the same content.
Which key should you use for a row in your table? That's the question that requires some consideration. To answer this, you need to understand how the data is actually stored. Till then, you can assume that each key has to be a unique string of bytes within the scope of a table and should be up to 64 KB in length.
Now that we know about tables, column families, columns, rows, and row keys, let's look at an example of BigTable that stores 'employees' information. Let's pretend that we are creating something similar to LinkedIn here. So, here's the table:
Personal
Professional
Key(name)
personal:lastname
personal:age
professinal:company
professional:designation
Mohsin
Hijazee
29
Sony
Senior Designer
Peter
Smith
34
Panasonic
General Manager
Kim
Yong
32
Sony
Director
Ricky
Martin
45
Panasonic
CTO
Paul
Jefferson
39
LG
Sales Head
So, 'this is a sample BigTable. The first column is the name, and we have chosen it as a key. It is of course not a good key, because the first name cannot necessarily be unique, even in small groups, let alone in millions of records. However, for the sake of this example, we will assume that the name is unique. Another reason behind assuming the name's uniqueness is that we want to increase our understanding gradually. So, the key point here is that we picked the first name as the row's key for now, but we will improve on this as we learn more.
Next, we have two column groups. The personal column group holds all the personal attributes of the employees, and the other column family named professional has all the other attributes pertaining to the professional aspects. When referring to a column within a family, the notation is family:column. So, personal:age contains the age of the employees.
If you look at professinal:designation and personal:age, it seems that the first one's contents are strings, while the second one stores integers. That's false. No column stores anything but just plain bytes without any distinction of what they mean. The meaning and interpretation of these bytes is up to the user of the data. From the point of view of BigTable', each column just contains plain old bytes. Another thing that is drastically different from RDBMS is such as MySQL is that each row need not have the same number of columns. Each row can adopt the layout that they want. So, the second row's personal column family can have two more columns that store gender and nationality.
For this particular example, the data is in no particular order, and I wrote it down as it came to my mind. Hence, there's no order of any sort in the data at all.
To summarize, BigTable is a key-value storage where keys should be unique and have a length that is less than or equal to 64 KB. The columns are divided into column families, which can be created at the time of defining the table, but each column family might have hundreds of columns created as and when needed. Also, contents have no data type and comprise just plain old bytes.
There's one minor detail left, which is not important as regards our purpose. However, for the sake of the completeness of the BigTable's data model, I will mention it now. Each value of the column is stored with a timestamp that is accurate to the microseconds, and in this way, multiple versions of a column value are available. The number of last versions that should be kept is something that is configurable at the table level, but since we are not going to deal with BigTable directly, this detail is not important to us.
How data is stored?
Now that we know about row keys, column families, and columns, we will gradually move towards examining this data model in detail and understand how the data is actually stored. We will examine the logical storage and then dive into the actual structure, as it ends up on the disk.
The data that we presented in the earlier table had no order and were listed as they came to my mind. However, while storing, the data is always sorted by the row key. So now, the data will actually be stored like this:
personal
professional
Key(name)
personal:lastname
personal:age
professinal:company
professional:designation
Kim
Yong
32
Sony
Director
Mohsin
Hijazee
29
Sony
Senior Designer
Paul
Jefferson
39
LG
Sales Head
Peter
Smith
34
Panasonic
General Manager
Ricky
Martin
45
Panasonic
CTO
OK, so what happened here? The name column indicates the key of the table and now, the whole table is sorted by the key. That's exactly how it is stored on the disk as well. 'An important thing about sorting is lexicographic sorting and not semantic sorting. By lexicographic, we mean that they are sorted by the byte value and not by the textness or the semantic sort. This matters because even within the Latin character set, different languages have different sort orders for letters, such as letters in English versus German and French. However, all of this and the Unicode collation order isn't valid here. It is just sorted by byte values. In our instance, since K has a smaller byte value (because K has a lower ASCII/Unicode value) than letter M, it comes first. Now, suppose that some European language considers and sorts M before K. That's not how the data would be laid out here, because it is a plain, blind, and simple sort. The data is sorted by the byte value, with no regard for the semantic value. In fact, for BigTable, this is not even text. It's just a plain string of bytes.
Just a hint. This order of keys is something that we will exploit when modeling data. How? We'll see later.
The Physical storage
Now that we understand the logical data model and how it is organized, it's time to take a closer look at how this data is actually stored on the disk. On a physical disk, the stored data is sorted by the key. So, key 1 is followed by its respective value, key 2 is followed by its respective value, and so on. At the end of the file, there's a sorted list of just the keys and their offset in the file from the start, which is something like the block to the right:
Ignore the block on your left that is labeled Index. We will come back to it in a while. This particular format actually has a name SSTable (String Storage Table) because it has strings (the keys), and they are sorted. It is of course tabular data, and hence the name.
Whenever your data is sorted, you have certain advantages, with the first and foremost advantage being that when you look up for an item or a range of items, 'your dataset is sorted. We will discuss this in detail later in this article. Now, if we start from the beginning of the file and read sequentially, noting down every key and then its offset in a format such as key:offset, we effectively create an index of the whole file in a single scan. That's where the first block to your left in the preceding diagram comes from. Since the keys are sorted in the file, we simply read it sequentially till the end of the file, hence effectively creating an index of the data. Furthermore, since this index only contains keys and their offsets in the file, it is much smaller in terms of the space it occupies.
Now, assuming that SSTable has a table that is, say, 500 MB in size, we only need to load the index from the end of the file into the memory, and whenever we are asked for a key or a range of keys, we just search within a memory index (thus not touching the disk at all). If we find the data, only then do we seek the disk at the given offset because we know the offset of that particular key from the index that we loaded in the memory.
Some limitations
Pretty smart, neat, and elegant, you would say! Yes it is. However, there's a catch. If you want to create a new row, key must come in a sorted order, and even if you are sure about where exactly this key should be placed in the file to avoid the need to sort the data, you still need to rewrite the whole file in a new, sorted order along with the index. Hence, large amounts of I/O are required for just a single row insertion.
The same goes for deleting a row because now, the file should be sorted and rewritten again. Updates are OK as long as the key itself is not altered because, in that case, it is sort of having a new key altogether. This is so because a modified key would have a different place in the sorted order, depending on what the key actually is. Hence, the whole file would be rewritten. Just for an example, say you have a row with the key as all-boys, and then you change the key of that row to x-rays-of-zebra. Now, you will see that after the new modification, the row will end up at nearly the end of the file, whereas previously, it was probably at the beginning of the file because all-boys comes before x-rays-of-zebra when sorted.
This seems pretty limiting, and it looks like inserting or removing a key is quite expensive. However, this is not the case, as we will see later.
Random writes and deletion
There's one last thing that's worth a mention before we examine the operations that are available on a BigTable. We'd like to examine how random writes and the deletion of rows are handled because that seems quite expensive, as we just examined in the preceding section.
The idea is very simple. All the read, writes, and removals don't go straight to the disk. Instead, an in-memory SSTable is created along with its index, both of which are empty when created. We'll call it MemTable from this point onwards for the sake of simplicity. Every read checks the index of this table, and if a record is found from here, it's well and good. If it is not, then the index of the SSTable on the disk is checked and the desired row is returned.
When a new row has to be read, we don't look at anything and simply enter the row in the MemTable along with its record in the index of this MemTable.
To delete a key, we simply mark it deleted in the memory, regardless of whether it is in MemTable or in the on disk table. As shown here the allocation of block into Mem Table:
Now, when the size of the MemTable grows up to a certain size, it is written to the disk as a new SSTable. Since this only depends on the size of the MemTable and of course happens much infrequently, it is much faster. Each time the MemTable grows beyond a configured size, it is flushed to the disk as a new SSTable. However, the index of each flushed SSTable is still kept in the memory so that we can quickly check the incoming read requests and locate it in any table without touching the disk.
Finally, when the number of SSTables reaches a certain count, the SSTables are merged and collapsed into a single SSTable. Since each SSTable is just a sorted set of keys, a merge sort is applied. This merging process is quite fast.
Congratulations! You've just learned the most atomic storage unit in BigData solutions such as BigTable, Hbase, Hypertable, Cassandara, and LevelDB. That's how they actually store and process the data.
Now that we know how a big table is actually stored on the disk and how the read and writes are handled, it's time to take a closer look at the available operations.
Operations on BigTable
Until this point, we know that a BigTable table is a collection of rows that have unique keys up to 64 KB in length and the data is stored according to the lexicographic sort order of the keys. We also examined how it is laid out on the disk and how read, writes, and removals are handled. Now, the question is, which operations are available on this data? The following are the operations that are available to us:
Fetching a row by using its key
Inserting a new key
Deleting a row
Updating a row
Reading a range of rows from the starting row key to the ending row key
Reading
Now, the first operation is pretty simple. You have a key, and you want the associated row. Since the whole data set is sorted by the key, all we need to do is perform a binary search on it, and you'll be able to locate your desired row within a few lookups, even within a set of a million rows. In practice, the index at the end of the SSTable is loaded in the memory, and the binary search is actually performed on it. If we take a closer look at this operation in light of what we know from the previous section, the index is already in the memory of the MemTable that we saw in the previous section. In case there are multiple SSTables because MemTable was flushed many times to the disk as it grew too large, all the indexes of all the SSTables are present in the memory, and a quick binary search is performed on them.
Writing
The second operation that is available to us is the ability to insert a new row. So, we have a key and the values that we want to insert in the table. According to our new knowledge about physical storage and SSTables, we can understand this very well. The write directly happens on the in-memory MemTable and its index is updated, which is also in the memory. Since no disk access is required to write the row as we are writing in memory, the whole file doesn't have to be rewritten on disk, because yet again, all of it is in the memory. This operation is very fast and almost instantaneous. However, if the MemTable grows in size, it will be flushed to the disk as a new SSTable along with the index while retaining a copy of its index in the memory. Finally, we also saw that when the number of SSTables reaches a certain number, they are merged and collapsed to form a new, bigger table.
Deleting
It seems that since all the keys are in a sorted order on the disk and deleting a key would mean disrupting the sort order, a rewrite of the whole file would be a big I/O overhead. However, it is not, as it can be handled smartly. Since all the indexes, including the MemTable and the tables that were the result of flushing a larger MemTable to the disk, are already in the memory, deleting a row only requires us to find the required key in the in-memory indexes and mark it as deleted.
Now, whenever someone tries to read the row, the in-memory indexes will be checked, and although an entry will be there, it will be marked as deleted and won't be returned. When MemTable is being flushed to the disk or multiple tables are being collapsed, this key and the associated row will be excluded in the write process. Hence, they are totally gone from the storage.
Updating
Updating a row is no different, but it has two cases. The first case is in which not only the values, but also the key is modified. In this case, it is like removing the row with an old key and inserting a row with a new key. We already have seen both of these cases in detail. So, the operation should be obvious.
However, the case where only the values are modified is even simpler. We only have to locate the row from the indexes, load it in the memory if it is not already there, and modify. That's all.
Scanning a range
This last operation is quite interesting. You can scan a range of keys from a starting key to an ending key. For instance, you can return all the rows that have a key greater than or equal to key1 and less than or equal to key2, effectively forming a range. Since the looking up of a single key is a fast operation, we only have to locate the first key of the range. Then, we start reading the consecutive keys one after the other till we encounter a key that is greater than key2, at which point, we will stop the scanning, and the keys that we scanned so far are our query's result. This is how it looks like:
Name
Department
Company
Chris Harris
Research & Development
Google
Christopher Graham
Research & Development
LG
Debra Lee
Accounting
Sony
Ernest Morrison
Accounting
Apple
Fred Black
Research & Development
Sony
Janice Young
Research & Development
Google
Jennifer Sims
Research & Development
Panasonic
Joyce Garrett
Human Resources
Apple
Joyce Robinson
Research & Development
Apple
Judy Bishop
Human Resources
Google
Kathryn Crawford
Human Resources
Google
Kelly Bailey
Research & Development
LG
Lori Tucker
Human Resources
Sony
Nancy Campbell
Accounting
Sony
Nicole Martinez
Research & Development
LG
Norma Miller
Human Resources
Sony
Patrick Ward
Research & Development
Sony
Paula Harvey
Research & Development
LG
Stephanie Chavez
Accounting
Sony
Stephanie Mccoy
Human Resources
Panasonic
In the preceding table, we said that the starting key will be greater than or equal to Ernest and ending key will be less than or equal to Kathryn. So, we locate the first key that is greater than or equal to Ernest, which happens to be Ernest Morrison. Then, we start scanning further, picking and returning each key as long as it is less than or equal to Kathryn. When we reach Judy, it is less than or equal to Kathryn, but Kathryn isn't. So, this row is not returned. However, the rows before this are returned. This is the last operation that is available to us on BigTable.
Selecting a key
Now that we have examined the data model and the storage layout, we are in a better position to talk about the key selection for a table. As we know that the stored data is sorted by the key, it does not impact the writing, deleting, and updating to fetch a single row. However, the operation that is impacted by the key is that of scanning a range.
Let's think about the previous table again and assume that this table is a part of some system that processes payrolls for companies, and the companies pay us for the task of processing their payroll. Now, let's suppose that Sony asks us to process their data and generate a payroll for them. Right now, we cannot do anything of this kind. We can just make our program scan the whole table, and hence all the records (which might be in millions), and only pick the records where job:company has the value of Sony. This would be inefficient.
Instead, what we can do is put this sorted nature of row keys to our service. Select the company name as the key and concatenate the designation and name along with it. So, the new table will look like this:
Key
Name
Department
Company
Apple-Accounting-Ernest Morrison
Ernest Morrison
Accounting
Apple
Apple-Human Resources-Joyce Garrett
Joyce Garrett
Human Resources
Apple
Apple-Research & Development-Joyce Robinson
Joyce Robinson
Research & Development
Apple
Google-Human Resources-Judy Bishop
Chris Harris
Research & Development
Google
Google-Human Resources-Kathryn Crawford
Janice Young
Research & Development
Google
Google-Research & Development-Chris Harris
Judy Bishop
Human Resources
Google
Google-Research & Development-Janice Young
Kathryn Crawford
Human Resources
Google
LG-Research & Development-Christopher Graham
Christopher Graham
Research & Development
LG
LG-Research & Development-Kelly Bailey
Kelly Bailey
Research & Development
LG
LG-Research & Development-Nicole Martinez
Nicole Martinez
Research & Development
LG
LG-Research & Development-Paula Harvey
Paula Harvey
Research & Development
LG
Panasonic-Human Resources-Stephanie Mccoy
Jennifer Sims
Research & Development
Panasonic
Panasonic-Research & Development-Jennifer Sims
Stephanie Mccoy
Human Resources
Panasonic
Sony-Accounting-Debra Lee
Debra Lee
Accounting
Sony
Sony-Accounting-Nancy Campbell
Fred Black
Research & Development
Sony
Sony-Accounting-Stephanie Chavez
Lori Tucker
Human Resources
Sony
Sony-Human Resources-Lori Tucker
Nancy Campbell
Accounting
Sony
Sony-Human Resources-Norma Miller
Norma Miller
Human Resources
Sony
Sony-Research & Development-Fred Black
Patrick Ward
Research & Development
Sony
Sony-Research & Development-Patrick Ward
Stephanie Chavez
Accounting
Sony
So, this is a new format. We just welded the company, department, and name as the key and as the table will always be sorted by the key, that's what it looks like, as shown in the preceding table. Now, suppose that we receive a request from Google to process their data. All we have to do is perform a scan, starting from the key greater than or equal to Google and less then L because that's the next letter. This scan is highlighted in the previous table.
Now, the next request is more specific. Sony asks us to process their data, but only for their accounting department. How do we do that? Quite simple! In this case, our starting key will be greater than or equal to Sony-Accounting, and the ending key can be Sony-Accountinga, where a is appended to indicate the end key in the range. The scanned range and the returned rows are highlighted in the previous table.
BigTable – a hands-on approach
Okay, enough of the theory. It is now time to take a break and perform some hands-on experimentation. By now, we know that about 80 percent of the BigTable and the other 20 percent of the complexity is scaling it to more than one machine. Our current discussion only assumed and focused on a single machine environment, and we assumed that the BigTable table is on our laptop and that's about it.
You might really want to experiment with what you learned. Fortunately, given that you have the latest version of Google Chrome or Mozilla Firefox, that's easy. You have BigTable right there! How? Let me explain.
Basically, from the ideas that we looked at pertaining to the stored key value, the sorted layout, the indexes of the sorted files, and all the operations that were performed on them, including scanning, we extracted a separate component called LevelDB. Meanwhile, as HTML was evolving towards HTML5, a need was felt to store data locally. Initially, SQLite3 was embedded in browsers, and there was a querying interface for you to play with. So all in all, you had an SQL database in the browser, which yielded a lot of possibilities. However, in recent years, W3C deprecated this specification and urged browser vendors to not implement it.
Instead of web databases that were based on SQLite3, they now have databases based on LevelDB that are actually key-value stores, where storage is always sorted by key. Hence, besides looking up for a key, you can scan across a range of keys.
Covering the IndexedDB API here would be beyond the scope of this book, but if you want to understand it and find out what the theory that we talked about looks like in practice, you can try using IndexedDB in your browser by visiting http://code.tutsplus.com/tutorials/working-with-indexeddb--net-34673.
The concepts of keys and the scanning of key ranges are exactly like those that we examined here as regards BigTable, and those about indexes are mainly from the concepts that we will examine in a later section about datastores.
Scaling BigTable to BigData
By now, you have probably understood the data model of BigTable, how it is laid out on the disk, and the advantages it offers. To recap once again, the BigTable installation may have many tables, each table may have many column families that are defined at the time of creating the table, and each column family may have many columns, as required. Rows are identified by keys, which have a maximum length of 64 KB, and the stored data is sorted by the key. We can receive, update, and delete a single row. We can also scan a range of rows from a starting key to an ending key.
So now, the question comes, how does this scale? We will provide a very high-level overview, neglecting the micro details to keep things simple and build a mental model that is useful to us as the consumers of BigTable, as we're not supposed to clone BigTable's implementation after all.
As we saw earlier, the basic storage unit in BigTable is a file format called SSTable that stores key-value pairs, which are sorted by the key, and has an index at its end. We also examined how the read, write, and delete work on an in-memory copy of the table and merged periodically with the table that is present on the disk. Lastly, we also mentioned that when the in memory is flushed as SSTables on the disk when reach a certain configurable count, they are merged into a bigger table.
The view so far presents the data model, its physical layout, and how operations work on it in cases where the data resides on a single machine, such as a situation where your laptop has a telephone directory of the entire Europe.
However, how does that work at larger scales? Neglecting the minor implementation details and complexities that arise in distributed systems, the overall architecture and working principles are simple. In case of a single machine, there's only one SSTable (or a few in case they are not merged into one) file that has to be taken care of, and all the operations have to be performed on it. However, in case this file does not fit on a single machine, we will of course have to add another machine, and half of the SSTable will reside on one machine, while the other half will be on the another machine.
This split would of course mean that each machine would have a range of keys. For instance, if we have 1 million keys (that look like key1, key2, key3, and so on), then the keys from key1 to key500000 might be on one machine, while the keys from key500001 to key1000000 will be on the second machine. So, we can say that each machine has a different key range for the same table. Now, although the data resides on two different machines, it is of course a single table that sprawls over two machines. These partitions or separate parts are called tablets. Let's see the Key allocation on two machines:
We will keep this system to only two machines and 1 million rows for the sake of discussion, but there may be cases where there are about 20 billion keys sprawling over some 12,000 machines, with each machine having a different range of keys. However, let's continue with this small cluster consisting of only two nodes.
Now, the problem is that as an external user who has no knowledge of which machine has which portion of the SSTable (and eventually, the key ranges on each machine), how can a key, say, key489087 be located?
For this, we will have to add something like a telephone directory, where I look up the table name and my desired key and I get to know the machine that I should contact to get the data associated with the key. So, we are going to add another node, which will be called the master. This master will again contain simple, plain SSTable, which is familiar to us. However, the key-value pair would be a very interesting one. Since this table would contain data about the other BigTable tables, let's call it the METADATA table. In the METADATA table, we will adopt the following format for the keys:
tablename_ending-row-key
Since we have only two machines and each machine has two tablets, the METADATA table will look like this:
Key
Value
employees_key500000
192.168.0.2
employees_key1000000
192.168.0.3
The master stores the location of each tablet server with the row key that is the encoding of the table name and the ending row of the tablet. So, the tablet has to be scanned. The master assigns tablets to different machines when required. Each tablet is about 100 MB to 200 MB in size. So, if we want to fetch a key, all we need to know is the following:
Location of the master server
Table in which we are looking for the key
The key itself
Now, we will concatenate the table name with the key and perform a scan on the METADATA table on the master node. Let's suppose that we are looking for key600000 in employees table. So, we would first be actually looking for the employees_key600000 key in the table on master machine. As you are familiar with the scan operation on SSTable (and METADATA is just an SSTable), we are looking for a key that is greater than or equal to employees_key600000, which happens to be employees_key1000000. From this lookup, the key that we get is employees_key1000000 against which, IP address 192.168.0.3 is listed. This means that this is the machine that we should connect to fetch our data. We used the word keys and not the key because it is a range scan operation. This will be clearer with another example.
Let's suppose that we want to process rows with keys starting from key400000 to key800000. Now, if you look at the distribution of data across the machine, you'll know that half of the required range is on one machine, while the other half is on the other. Now, in this case, when we consult the METADATA table, two rows will be returned to us because key400000 is less then key500000 (which is the ending row key for data on the first machine) and key800000 is less then key1000000, which is the ending row for the data on the second machine. So, with these two rows returned, we have two locations to fetch our data from.
This leads to an interesting side-effect. As the data resides on two different machines, this can be read or processed in parallel, which leads to an improved system performance. This is one reason why even with larger datasets, the performance of BigTable won't deteriorate as badly as it would have if it were a single, large machine with all the data on it.
The datastore thyself
So until now, everything that we talked about was about BigTable, and we did not mention datastore at all. Now is the time to look at datastore in detail because we understand BigTable quite well now. Datastore is an effectively solution that was built on top of BigTable as a persistent NoSQL layer for Google App Engine.
As we know that BigTable might have different tables, data for all the applications is stored in six separate tables, where each table stores a different aspect or information about the data. Don't worry about memorizing things about data modeling and how to use it for now, as this is something that we are going to look into in greater detail later.
The fundamental unit of storage in datastore is called a property. You can think of a property as a column. So, a property has a name and type. You can group multiple properties into a Kind, which effectively is a Python class and analogous to a table in the RDBMS world. Here's a pseudo code sample:
# 1. Define our Kind and how it looks like.
class Person(object):
name = StringProperty()
age = IntegerProperty()
# 2. Create an entity of kind person
ali = Person(name='Ali', age='24)
bob = Person(name='Bob', age='34)
david = Person(name='David', age='44)
zain = Person(name='Zain', age='54)
# 3. Save it
ali.put()
bob.put()
david.put()
zain.put()
This looks a lot like an ORM such as Django's ORM, SQLAlchemy, or Rails ActiveRecord. So, Person class is called a Kind in App Engine's terminology. The StringProperty and IntegerProperty property classes are used to indicate the type of the data that is supposed to be stored. We created an instance of the Person class as mohsin. This instance is called an entity in App Engine's terminology. Each entity, when stored, has a key that is not only unique throughout your application, but also combined with your application ID. It becomes unique throughout all the applications that are hosted over Google App Engine.
All entities of all kinds for all apps are stored in a single BigTable, and they are stored in a way where all the property values are serialized and stored in a single BigTable column. Hence, no separate columns are defined for each property.
This is interesting and required as well because if we are Google App Engine's architects, we do not know the Kind of data that people are going to store or the number and types of properties that they would define so that it makes sense to serialize the whole thing as one and store them in one column.
So, this is how it looks like:
Key
Kind
Data
agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM
Person
{name: 'Ali', age: 24}
agtkZXZ-bWdhZS0wMXIPCxNTVVyc29uIgNBbGkM
Person
{name: 'Bob', age: 34}
agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIgNBbBQM
Person
{name: 'David', age: 44}
agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIRJ3bGkM
Person
{name: 'Zain', age: 54}
The key appears to be random, but it is not. A key is formed by concatenating your application ID, your Kind name (Person here), and either a unique identifier that is auto generated by Google App Engine, or a string that is supplied by you.
The key seems cryptic, but it is not safe to pass it around in public, as someone might decode it and take advantage of it. Basically, it is just base 64 encoded and can easily be decoded to know the entity's Kind name and ID. A better way would be to encrypt it using a secret key and then pass it around in public. On the other hand, to receive it, you will have to decrypt it using the same key. A gist of this is available on GitHub that can serve the purpose. To view this, visit https://gist.github.com/mohsinhijazee/07cdfc2826a565b50a68. However, for it to work, you need to edit your app.yaml file so that it includes the following:
libraries:
- name: pycrypto
version: latest
Then, you can call the encrypt() method on the key while passing around and decrypt it back using the decrypt() method, as follows:
person = Person(name='peter', age=10)
key = person.put()
url_safe_key = key.urlsafe()
safe_to_pass_around = encrypt(SECRET_KEY, url_safe_key)
Now, when you have a key from the outside, you should first decrypt it and then use it, as follows:
key_from_outside = request.params.get('key')
url_safe_key = decrypt(SECRET_KEY, key_from_outside)
key = ndb.Key(urlsafe=url_safe_key)
person = key.get()
The key object is now good for use. To summarize, just get the URL safe key by calling the ndb.Key.urlsafe() method and encrypt it so that it can be passed around. On return, just do the reverse.
If you really want to see how the encrypt and decrypt operations are implemented, they are reproduced as follows without any documentation/comments, as cryptography is not our main subject:
import os
import base64
from Crypto.Cipher import AES
BLOCK_SIZE = 32
PADDING='#'
def _pad(data, pad_with=PADDING):
return data + (BLOCK_SIZE - len(data) % BLOCK_SIZE) * PADDING
def encrypt(secret_key, data):
cipher = AES.new(_pad(secret_key, '@')[:32])
return base64.b64encode(cipher.encrypt(_pad(data)))
def decrypt(secret_key, encrypted_data):
cipher = AES.new(_pad(secret_key, '@')[:32])
return cipher.decrypt(base64.b64decode
(encrypted_data)).rstrip(PADDING)
KEY='your-key-super-duper-secret-key-here-only-first-32-characters-are-used'
decrypted = encrypt(KEY, 'Hello, world!')
print decrypted
print decrypt(KEY, decrypted)
More explanation on how this works is given at https://gist.github.com/mohsinhijazee/07cdfc2826a565b50a68.
Now, let's come back to our main subject, datastore. As you can see, all the data is stored in a single column, and if we want to query something, for instance, people who are older than 25, we have no way to do this. So, how will this work? Let's examine this next.
Supporting queries
Now, what if we want to get information pertaining to all the people who are older than, say, 30? In the current scheme of things, this does not seem to be something that is doable, because the data is serialized and dumped, as shown in the previous table. Datastore solves this problem by putting the sorted values to be queried upon as keys. So here, we want to query by age. Datastore will create a record in another table called the Index table. This index table is nothing but just a plain BigTable, where the row keys are actually the property value that you want to query. Hence, a scan and a quick lookup is possible. Here's how it would look like:
Key
Entity key
Myapp-person-age-24
agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM
Myapp-person-age-34
agtkZXZ-bWdhZS0wMXIPCxNTVVyc29uIgNBbGkM
Myapp-person-age-44
agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIgNBbBQM
Myapp-person-age-54
agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIRJ3bGkM
Implementation details
So, all in all, Datastore actually builds a NoSQL solution on top of BigTable by using the following six tables:
A table to store entities
A table to store entities by kind
A table to store indexes for the property values in the ascending order
A table to store indexes for the property values in the descending order
A table to store indexes for multiple properties together
A table to keep a track of the next unique ID for Kind
Let us look at each table in turn. The first table is used to store entities for all the applications. We have examined this in an example.
The second table just stores the Kind names. Nothing fancy here. It's just some metadata that datastore maintains for itself. Think of this—you want to get all the entities that are of the Person Kind. How will you do this? If you look at the entities table alone and the operations that are available to us on a BigTable table, you will know that there's no such way for us to fetch all the entities of a certain Kind. This table does exactly this. It looks like this:
Key
Entity key
Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM
AgtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM
Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBb854
agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBb854
Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVy748IgNBbGkM
agtkZXZ-agtkZXZ-bWdhZS0wMXIQTXIGUGVy748IgNBbGkM
So, as you can see, this is just a simple BigTable table where the keys are of the [app ID]-[Kind name]-[entity key] pattern.
The tables 3, 4, and 5 from the six tables that were mentioned in the preceding list are similar to the table that we examined in the Supporting queries section labeled Data as stored in BigTable.
This leaves us with the last table. As you know that while storing entities, it is important to have a unique key for each row. Since all the entities from all the apps are stored in a single table, they should be unique across the whole table. When datastore generates a key for an entity that has to be stored, it combines your application ID and the Kind name of the entity. Now, this much part of the key only makes it unique across all the other entities in the table, but not within the set of your own entities. To do this, you need a number that should be appended to this. This is exactly similar to how AUTO INCREMENT works in the RDBMS world, where the value of a column is automatically incremented to ensure that it is unique. So, that's exactly what the last table is for. It keeps a track of the last ID that was used by each Kind of each application, and it looks like this:
Key
Next ID
Myapp-Person
65
So, in this table, the key is of the [application ID]-[Kind name] format, and the value is the next value, which is 65 in this particular case. When a new entity of kind Person is created, it will be assigned 65 as the ID, and the row will have a new value of 66. Our application has only one Kind defined, which is Person. Therefore, there's only one row in this table because we are only keeping track for the next ID for this Kind. If we had another Kind, say, Group, it will have its own row in this table.
Summary
We started this article with the problem of storing huge amounts of data, processing it in bulk, and randomly accessing it. This arose from the fact that we were ambitious to store every single web page on earth and process it to extract some results from it. We introduced a solution called BigTable and examined its data model. We saw that in BigTable, we can define multiple tables, with each table having multiple column families, which are defined at the time of creating the table. We learned that column families are logical groupings of columns, and new columns can be defined in a column family, as needed. We also learned that the data store in BigTable has no meaning on its own, and it stores them just as plain bytes; its interpretation and meanings depend on the user of data. We also learned that each row in BigTable has a unique row key, which has a length of 64 KB.
Lastly, we turned our attention to datastore, a NoSQL storage solution built on top of BigTable for Google App Engine. We briefly mentioned some datastore terminology such as properties (columns), entities (rows), and kinds (tables). We learned that all data is stored across six different BigTable tables. This captured a different aspect of data. Most importantly, we learned that all the entities of all the apps hosted on Google App Engine are stored in a single BigTable and all properties go to a single BigTable column. We also learned how querying is supported by additional tables that are keyed by the property values that list the corresponding keys.
This concludes our discussion on Google App Engine's datastore and its underlying technology, workings, and related concepts. Next, we will learn how to model our data on top of datastore. What we learned in this chapter will help us enormously in understanding how to better model our data to take full advantage of the underlying mechanisms.
Resources for Article:
Further resources on this subject:
Google Guice[article]
The EventBus Class[article]
Integrating Google Play Services [article]
Read more