NoSQL, or non-relational databases, are increasingly used in big data and real-time web applications. These databases are non-relational in nature and they provide a mechanism for storage and the retrieval of information that is not tabular.
There are many advantages of using NoSQL database:
Recently we were discussing the possibility of changing our data storage from HDF5 files to some NoSQL system. HDF5 files are great for the storage and retrieval purposes. But now with huge data coming in we need to scale up, and also the hierarchical schema of HDF5 files is not very well suited for all sorts of data we are using. I am a bioinformatician working on data science applications to genomic data. We have genomic annotation files (GFF format), genotype sequences (FASTA format), phenotype data (tables), and a lot of other data formats. We want to be able to store data in a space and memory efficient way and also the framework should facilitate fast retrieval. I did some research on the NoSQL options and prepared this cheat-sheet. This will be very useful for someone thinking about moving their storage to non-relational databases. Also, data scientists need to be very comfortable with the basic ideas of NoSQL DB's. In the course Introduction to Data Science by Prof. Bill Howe (UWashinton) on Coursera, NoSQL DB's formed a significant part of the lectures. I highly recommend the lectures on these topics and this course in general. This cheat-sheet should also assist aspiring data scientists in their interviews.
Some options for NoSQL databases:
Membase:
This is key-value type database. It is very efficient if you only need to quickly retrieve a value according to a key. It has all of the advantages of memcached when it comes to the low cost of implementation. There is not much emphasis on scalability, but lookups are very fast. It has a JSON format with no predefined schema. The weakness of using it for important data is that it's a pure key-value store, and thus is not queryable on properties.
MongoDB:
If you need to associate a more complex structure, such as a document to a key, then MongoDB is a good option. With a single query, you are going to retrieve the whole document and it can be a huge win. However using these documents like simple key/value stores would not be as fast and as space-efficient as Membase.
Berkeley DB:
It stores records in key-value pairs. Both key and value can be arbitrary byte strings, and can be of variable lengths. You can put native programming language data structures into the database without converting to a foreign record first. Storage and retrieval are very simple, but the application needs to know what the structure of a key and a value is in advance, it can't ask the DB.
Berkeley DB v/s MongoDB:
Redis:
If you need more structures like lists, sets, ordered sets and hashes, then Redis is the best bet. It's very fast and provides useful data-structures. It just works, but don't expect it to handle every use-case. Nevertheless, it is certainly possible to use Redis as your primary data-store. But it is used less for distributed scalability, but optimizes high performance lookups at the cost of no longer supporting relational queries.
Cassandra:
Each key has values as columns and columns are grouped together into sets called column families. Thus each key identifies a row of a variable number of elements.
HBase:
It is modeled after Google's Bigtable DB. The deal use for HBase is in the situations when you need improved flexibility, great performance, scaling and have Big Data. The data structure is similar to Cassandra where you have column families.
HBase V/S Cassandra:
Hbase is more suitable for data warehousing and large scale data processing and analysis (indexing the web as in a search engine) and Cassandra is more apt for real time transaction processing and the serving of interactive data.
Resources
Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.