Chapter 6. Interacting with Databases
Data analysis starts with data. It is therefore beneficial to work with data storage systems that are simple to set up, operate and where the data access does not become a problem in itself. In short, we would like to have database systems that are easy to embed into our data analysis processes and workflows. In this book, we focus mostly on the Python side of the database interaction, and we will learn how to get data into and out of Pandas data structures.
There are numerous ways to store data. In this chapter, we are going to learn to interact with three main categories: text formats, binary formats and databases. We will focus on two storage solutions, MongoDB and Redis. MongoDB is a document-oriented database, which is easy to start with, since we can store JSON documents and do not need to define a schema upfront. Redis is a popular in-memory data structure store on top of which many applications can be built. It is possible to use Redis as a fast key-value store, but Redis supports lists, sets, hashes, bit arrays and even advanced data structures such as HyperLogLog out of the box as well.
Interacting with data in text format
Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.
In this section we will start reading and writing data from and to text files.
Reading data from text format
Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.
Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv()
function. Let's start with a small example file:
$ cat example_data/ex_06-01.txt Name,age,major_id,sex,hometown Nam,7,1,male,hcm Mai,11,1,female,hcm Lan,25,3,female,hn Hung,42,3,male,tn Nghia,26,3,male,dn Vinh,39,3,male,vl Hong,28,4,female,dn
Tip
The cat
is the Unix shell command that can be used to print the content of a file to the screen.
In the above example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:
>>> df_ex1 = pd.read_csv('example_data/ex_06-01.txt') >>> df_ex1 Name age major_id sex hometown 0 Nam 7 1 male hcm 1 Mai 11 1 female hcm 2 Lan 25 3 female hn 3 Hung 42 3 male tn 4 Nghia 26 3 male dn 5 Vinh 39 3 male vl 6 Hong 28 4 female dn
We see that the read_csv
function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep
parameter to change the separated symbol and set header=None
in case the example file does not have a caption row.
See the below example:
$ cat example_data/ex_06-02.txt Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn Nghia 26 3 male dn Vinh 39 3 male vl Hong 28 4 female dn >>> df_ex2 = pd.read_csv('example_data/ex_06-02.txt', sep = '\t', header=None) >>> df_ex2 0 1 2 3 4 0 Nam 7 1 male hcm 1 Mai 11 1 female hcm 2 Lan 25 3 female hn 3 Hung 42 3 male tn 4 Nghia 26 3 male dn 5 Vinh 39 3 male vl 6 Hong 28 4 female dn
We can also set a specific row as the caption row by using the header
that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col
to the name or index of the column. We again use the second data file example_data/ex_06-02.txt
to illustrate this:
>>> df_ex3 = pd.read_csv('example_data/ex_06-02.txt', sep = '\t', header=None, index_col=0) >>> df_ex3 1 2 3 4 0 Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn Nghia 26 3 male dn Vinh 39 3 male vl Hong 28 4 female dn
Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:
Parameter |
Value |
Description |
---|---|---|
|
Type name or dictionary of type of columns |
Sets the data type for data or columns. By default it will try to infer the most appropriate data type. |
|
List-like or integer |
The number of lines to skip (starting from 0). |
|
List-like or dict, default None |
Values to recognize as |
|
List |
A list of values to be converted to Boolean True as well. |
|
List |
A list of values to be converted to Boolean False as well. |
|
|
If the |
|
|
The thousands separator |
|
|
Limits the number of rows to read from the file. |
|
|
If set to True, a DataFrame is returned, even if an error occurred during parsing. |
Besides the read_csv()
function, we also have some other parsing functions in Pandas:
Function |
Description |
---|---|
|
Read the general delimited file into DataFrame |
|
Read a table of fixed-width formatted lines into DataFrame |
|
Read text from the clipboard and pass to |
In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:
$ cat example_data/ex_06-03.txt Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn single Nghia 26 3 male dn single Vinh 39 3 male vl Hong 28 4 female dn >>> import csv >>> f = open('data/ex_06-03.txt') >>> r = csv.reader(f, delimiter='\t') >>> for line in r: >>> print(line) ['Nam', '7', '1', 'male', 'hcm'] ['Mai', '11', '1', 'female', 'hcm'] ['Lan', '25', '3', 'female', 'hn'] ['Hung', '42', '3', 'male', 'tn', 'single'] ['Nghia', '26', '3', 'male', 'dn', 'single'] ['Vinh', '39', '3', 'male', 'vl'] ['Hong', '28', '4', 'female', 'dn']
Writing data to text format
We saw how to load data from a text file into a Pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv()
function, we also have the to_csv()
function, supported by Pandas. Let's see an example below:
>>> df_ex3.to_csv('example_data/ex_06-02.out', sep = ';')
The result will look like this:
$ cat example_data/ex_06-02.out 0;1;2;3;4 Nam;7;1;male;hcm Mai;11;1;female;hcm Lan;25;3;female;hn Hung;42;3;male;tn Nghia;26;3;male;dn Vinh;39;3;male;vl Hong;28;4;female;dn
If we want to skip the header line or index column when writing out data into a disk file, we can set a False
value to the header and index parameters:
>>> import sys >>> df_ex3.to_csv(sys.stdout, sep='\t', header=False, index=False) 7 1 male hcm 11 1 female hcm 25 3 female hn 42 3 male tn 26 3 male dn 39 3 male vl 28 4 female dn
We can also write a subset of the columns of the DataFrame to the file by specifying them in the columns
parameter:
>>> df_ex3.to_csv(sys.stdout, columns=[3,1,4], header=False, sep='\t') Nam male 7 hcm Mai female 11 hcm Lan female 25 hn Hung male 42 tn Nghia male 26 dn Vinh male 39 vl Hong female 28 dn
With series objects, we can use the same function to write data into text files, with mostly the same parameters as above.
Interacting with data in binary format
We can read and write binary serialization of Python objects with the pickle module, which can be found in the standard library. Object serialization can be useful, if you work with objects that take a long time to create, like some machine learning models. By pickling such objects, subsequent access to this model can be made faster. It also allows you to distribute Python objects in a standardized way.
Pandas includes support for pickling out of the box. The relevant methods are the read_pickle()
and to_pickle()
functions to read and write data from and to files easily. Those methods will write data to disk in the pickle format, which is a convenient short-term storage format:
>>> df_ex3.to_pickle('example_data/ex_06-03.out') >>> pd.read_pickle('example_data/ex_06-03.out') 1 2 3 4 0 Nam 7 1 male hcm Mai 11 1 female hcm Lan 25 3 female hn Hung 42 3 male tn Nghia 26 3 male dn Vinh 39 3 male vl Hong 28 4 female dn
HDF5
HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py
which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py
, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:
>>> store = pd.HDFStore('hdf5_store.h5') >>> store <class 'pandas.io.pytables.HDFStore'> File path: hdf5_store.h5 Empty
We created an empty HDF5 file, named hdf5_store.h5
. Now, we can write data to the file just like adding key-value pairs to a dict
:
>>> store['ex3'] = df_ex3 >>> store['name'] = df_ex2[0] >>> store['hometown'] = df_ex3[4] >>> store <class 'pandas.io.pytables.HDFStore'> File path: hdf5_store.h5 /ex3 frame (shape->[7,4]) /hometown series (shape->[1]) /name series (shape->[1])
Objects stored in the HDF5 file can be retrieved by specifying the object keys:
>>> store['name'] 0 Nam 1 Mai 2 Lan 3 Hung 4 Nghia 5 Vinh 6 Hong Name: 0, dtype: object
Once we have finished interacting with the HDF5 file, we close it to release the file handle:
>>> store.close() >>> store <class 'pandas.io.pytables.HDFStore'> File path: hdf5_store.h5 File is CLOSED
There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables
and h5py
– if you need to work with huge quantities of data.
HDF5
HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py
which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py
, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:
>>> store = pd.HDFStore('hdf5_store.h5') >>> store <class 'pandas.io.pytables.HDFStore'> File path: hdf5_store.h5 Empty
We created an empty HDF5 file, named hdf5_store.h5
. Now, we can write data to the file just like adding key-value pairs to a dict
:
>>> store['ex3'] = df_ex3 >>> store['name'] = df_ex2[0] >>> store['hometown'] = df_ex3[4] >>> store <class 'pandas.io.pytables.HDFStore'> File path: hdf5_store.h5 /ex3 frame (shape->[7,4]) /hometown series (shape->[1]) /name series (shape->[1])
Objects stored in the HDF5 file can be retrieved by specifying the object keys:
>>> store['name'] 0 Nam 1 Mai 2 Lan 3 Hung 4 Nghia 5 Vinh 6 Hong Name: 0, dtype: object
Once we have finished interacting with the HDF5 file, we close it to release the file handle:
>>> store.close() >>> store <class 'pandas.io.pytables.HDFStore'> File path: hdf5_store.h5 File is CLOSED
There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables
and h5py
– if you need to work with huge quantities of data.
Interacting with data in MongoDB
Many applications require more robust storage systems then text files, which is why many applications use databases to store data. There are many kinds of databases, but there are two broad categories: relational databases, which support a standard declarative language called SQL, and so called NoSQL databases, which are often able to work without a predefined schema and where a data instance is more properly described as a document, rather as a row.
MongoDB is a kind of NoSQL database that stores data as documents, which are grouped together in collections. Documents are expressed as JSON objects. It is fast and scalable in storing, and also flexible in querying, data. To use MongoDB in Python, we need to import the pymongo
package and open a connection to the database by passing a hostname and port. We suppose that we have a MongoDB instance, running on the default host (localhost
) and port (27017
):
>>> import pymongo >>> conn = pymongo.MongoClient(host='localhost', port=27017)
If we do not put any parameters into the pymongo.MongoClient()
function, it will automatically use the default host and port.
In the next step, we will interact with databases inside the MongoDB instance. We can list all databases that are available in the instance:
>>> conn.database_names() ['local'] >>> lc = conn.local >>> lc Database(MongoClient('localhost', 27017), 'local')
The above snippet says that our MongoDB instance only has one database, named 'local'. If the databases and collections we point to do not exist, MongoDB will create them as necessary:
>>> db = conn.db >>> db Database(MongoClient('localhost', 27017), 'db')
Each database contains groups of documents, called collections. We can understand them as tables in a relational database. To list all existing collections in a database, we use collection_names()
function:
>>> lc.collection_names() ['startup_log', 'system.indexes'] >>> db.collection_names() []
Our db
database does not have any collections yet. Let's create a collection, named person
, and insert data from a DataFrame object to it:
>>> collection = db.person >>> collection Collection(Database(MongoClient('localhost', 27017), 'db'), 'person') >>> # insert df_ex2 DataFrame into created collection >>> import json >>> records = json.load(df_ex2.T.to_json()).values() >>> records dict_values([{'2': 3, '3': 'male', '1': 39, '4': 'vl', '0': 'Vinh'}, {'2': 3, '3': 'male', '1': 26, '4': 'dn', '0': 'Nghia'}, {'2': 4, '3': 'female', '1': 28, '4': 'dn', '0': 'Hong'}, {'2': 3, '3': 'female', '1': 25, '4': 'hn', '0': 'Lan'}, {'2': 3, '3': 'male', '1': 42, '4': 'tn', '0': 'Hung'}, {'2': 1, '3':'male', '1': 7, '4': 'hcm', '0': 'Nam'}, {'2': 1, '3': 'female', '1': 11, '4': 'hcm', '0': 'Mai'}]) >>> collection.insert(records) [ObjectId('557da218f21c761d7c176a40'), ObjectId('557da218f21c761d7c176a41'), ObjectId('557da218f21c761d7c176a42'), ObjectId('557da218f21c761d7c176a43'), ObjectId('557da218f21c761d7c176a44'), ObjectId('557da218f21c761d7c176a45'), ObjectId('557da218f21c761d7c176a46')]
The df_ex2
is transposed and converted to a JSON string before loading into a dictionary. The insert()
function receives our created dictionary from df_ex2
and saves it to the collection.
If we want to list all data inside the collection, we can execute the following commands:
>>> for cur in collection.find(): >>> print(cur) {'4': 'vl', '2': 3, '3': 'male', '1': 39, '_id': ObjectId('557da218f21c761d7c176 a40'), '0': 'Vinh'} {'4': 'dn', '2': 3, '3': 'male', '1': 26, '_id': ObjectId('557da218f21c761d7c176 a41'), '0': 'Nghia'} {'4': 'dn', '2': 4, '3': 'female', '1': 28, '_id': ObjectId('557da218f21c761d7c1 76a42'), '0': 'Hong'} {'4': 'hn', '2': 3, '3': 'female', '1': 25, '_id': ObjectId('557da218f21c761d7c1 76a43'), '0': 'Lan'} {'4': 'tn', '2': 3, '3': 'male', '1': 42, '_id': ObjectId('557da218f21c761d7c176 a44'), '0': 'Hung'} {'4': 'hcm', '2': 1, '3': 'male', '1': 7, '_id': ObjectId('557da218f21c761d7c176 a45'), '0': 'Nam'} {'4': 'hcm', '2': 1, '3': 'female', '1': 11, '_id': ObjectId('557da218f21c761d7c 176a46'), '0': 'Mai'}
If we want to query data from the created collection with some conditions, we can use the find()
function and pass in a dictionary describing the documents we want to retrieve. The returned result is a cursor type, which supports the iterator protocol:
>>> cur = collection.find({'3' : 'male'}) >>> type(cur) pymongo.cursor.Cursor >>> result = pd.DataFrame(list(cur)) >>> result 0 1 2 3 4 _id 0 Vinh 39 3 male vl 557da218f21c761d7c176a40 1 Nghia 26 3 male dn 557da218f21c761d7c176a41 2 Hung 42 3 male tn 557da218f21c761d7c176a44 3 Nam 7 1 male hcm 557da218f21c761d7c176a45
Sometimes, we want to delete data in MongdoDB. All we need to do is to pass a query to the remove()
method on the collection:
>>> # before removing data >>> pd.DataFrame(list(collection.find())) 0 1 2 3 4 _id 0 Vinh 39 3 male vl 557da218f21c761d7c176a40 1 Nghia 26 3 male dn 557da218f21c761d7c176a41 2 Hong 28 4 female dn 557da218f21c761d7c176a42 3 Lan 25 3 female hn 557da218f21c761d7c176a43 4 Hung 42 3 male tn 557da218f21c761d7c176a44 5 Nam 7 1 male hcm 557da218f21c761d7c176a45 6 Mai 11 1 female hcm 557da218f21c761d7c176a46 >>> # after removing records which have '2' column as 1 and '3' column as 'male' >>> collection.remove({'2': 1, '3': 'male'}) {'n': 1, 'ok': 1} >>> cur_all = collection.find(); >>> pd.DataFrame(list(cur_all)) 0 1 2 3 4 _id 0 Vinh 39 3 male vl 557da218f21c761d7c176a40 1 Nghia 26 3 male dn 557da218f21c761d7c176a41 2 Hong 28 4 female dn 557da218f21c761d7c176a42 3 Lan 25 3 female hn 557da218f21c761d7c176a43 4 Hung 42 3 male tn 557da218f21c761d7c176a44 5 Mai 11 1 female hcm 557da218f21c761d7c176a46
We learned step by step how to insert, query and delete data in a collection. Now, we will show how to update existing data in a collection in MongoDB:
>>> doc = collection.find_one({'1' : 42}) >>> doc['4'] = 'hcm' >>> collection.save(doc) ObjectId('557da218f21c761d7c176a44') >>> pd.DataFrame(list(collection.find())) 0 1 2 3 4 _id 0 Vinh 39 3 male vl 557da218f21c761d7c176a40 1 Nghia 26 3 male dn 557da218f21c761d7c176a41 2 Hong 28 4 female dn 557da218f21c761d7c176a42 3 Lan 25 3 female hn 557da218f21c761d7c176a43 4 Hung 42 3 male hcm 557da218f21c761d7c176a44 5 Mai 11 1 female hcm 557da218f21c761d7c176a46
The following table shows methods that provide shortcuts to manipulate documents in MongoDB:
Update Method |
Description |
---|---|
|
Increment a numeric field |
|
Set certain fields to new values |
|
Remove a field from the document |
|
Append a value onto an array in the document |
|
Append several values onto an array in the document |
|
Add a value to an array, only if it does not exist |
|
Remove the last value of an array |
|
Remove all occurrences of a value from an array |
|
Remove all occurrences of any set of values from an array |
|
Rename a field |
|
Update a value by bitwise operation |
Interacting with data in Redis
Redis is an advanced kind of key-value store where the values can be of different types: string, list, set, sorted set or hash. Redis stores data in memory like memcached but it can be persisted on disk, unlike memcached, which has no such option. Redis supports fast reads and writes, in the order of 100,000 set or get operations per second.
To interact with Redis, we need to install the Redis-py
module to Python, which is available on pypi
and can be installed with pip
:
$ pip install redis
Now, we can connect to Redis via the host and port of the DB server. We assume that we have already installed a Redis server, which is running with the default host (localhost
) and port (6379
) parameters:
>>> import redis >>> r = redis.StrictRedis(host='127.0.0.1', port=6379) >>> r StrictRedis<ConnectionPool<Connection<host=localhost,port=6379,db=0>>>
As a first step to storing data in Redis, we need to define which kind of data structure is suitable for our requirements. In this section, we will introduce four commonly used data structures in Redis: simple value, list, set and ordered set. Though data is stored into Redis in many different data structures, each value must be associated with a key.
The simple value
This is the most basic kind of value in Redis. For every key in Redis, we also have a value that can have a data type, such as string, integer or double. Let's start with an example for setting and getting data to and from Redis:
>>> r.set('gender:An', 'male') True >>> r.get('gender:An') b'male'
In this example we want to store the gender info of a person, named An
into Redis. Our key is gender:An
and our value is male
. Both of them are a type of string.
The set()
function receives two parameters: the key and the value. The first parameter is the key and the second parameter is value. If we want to update the value of this key, we just call the function again and change the value of the second parameter. Redis automatically updates it.
The get()
function will retrieve the value of our key, which is passed as the parameter. In this case, we want to get gender information of the key gender:An
.
In the second example, we show you another kind of value type, an integer:
>>> r.set('visited_time:An', 12) True >>> r.get('visited_time:An') b'12' >>> r.incr('visited_time:An', 1) 13 >>> r.get('visited_time:An') b'13'
We saw a new function, incr()
, which used to increment the value of key by a given amount. If our key does not exist, RedisDB will create the key with the given increment as the value.
List
We have a few methods for interacting with list values in Redis. The following example uses rpush()
and lrange()
functions to put and get list data to and from the DB:
>>> r.rpush('name_list', 'Tom') 1L >>> r.rpush('name_list', 'John') 2L >>> r.rpush('name_list', 'Mary') 3L >>> r.rpush('name_list', 'Jan') 4L >>> r.lrange('name_list', 0, -1) [b'Tom', b'John', b'Mary', b'Jan'] >>> r.llen('name_list') 4 >>> r.lindex('name_list', 1) b'John'
Besides the rpush()
and lrange()
functions we used in the example, we also want to introduce two others functions. First, the llen()
function is used to get the length of our list in the Redis for a given key. The lindex()
function is another way to retrieve an item of the list. We need to pass two parameters into the function: a key and an index of item in the list. The following table lists some other powerful functions in processing list data structure with Redis:
Function |
Description |
---|---|
|
Push value onto the tail of the list name if name exists |
|
Remove and return the last item of the list name |
|
Set item at the index position of the list name to input value |
|
Push value on the head of the list name if name exists |
|
Remove and return the first item of the list name |
Set
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
>>> r.sadd('country', 'USA') 1 >>> r.sadd('country', 'Italy') 1 >>> r.sadd('country', 'Singapore') 1 >>> r.sadd('country', 'Singapore') 0 >>> r.smembers('country') {b'Italy', b'Singapore', b'USA'} >>> r.srem('country', 'Singapore') 1 >>> r.smembers('country') {b'Italy', b'USA'}
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
Ordered set
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
>>> r.zadd('person:A', 10, 'sub:Math') 1 >>> r.zadd('person:A', 7, 'sub:Bio') 1 >>> r.zadd('person:A', 8, 'sub:Chem') 1 >>> r.zrange('person:A', 0, -1) [b'sub:Bio', b'sub:Chem', b'sub:Math'] >>> r.zrange('person:A', 0, -1, withscores=True) [(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
The simple value
This is the most basic kind of value in Redis. For every key in Redis, we also have a value that can have a data type, such as string, integer or double. Let's start with an example for setting and getting data to and from Redis:
>>> r.set('gender:An', 'male') True >>> r.get('gender:An') b'male'
In this example we want to store the gender info of a person, named An
into Redis. Our key is gender:An
and our value is male
. Both of them are a type of string.
The set()
function receives two parameters: the key and the value. The first parameter is the key and the second parameter is value. If we want to update the value of this key, we just call the function again and change the value of the second parameter. Redis automatically updates it.
The get()
function will retrieve the value of our key, which is passed as the parameter. In this case, we want to get gender information of the key gender:An
.
In the second example, we show you another kind of value type, an integer:
>>> r.set('visited_time:An', 12) True >>> r.get('visited_time:An') b'12' >>> r.incr('visited_time:An', 1) 13 >>> r.get('visited_time:An') b'13'
We saw a new function, incr()
, which used to increment the value of key by a given amount. If our key does not exist, RedisDB will create the key with the given increment as the value.
List
We have a few methods for interacting with list values in Redis. The following example uses rpush()
and lrange()
functions to put and get list data to and from the DB:
>>> r.rpush('name_list', 'Tom') 1L >>> r.rpush('name_list', 'John') 2L >>> r.rpush('name_list', 'Mary') 3L >>> r.rpush('name_list', 'Jan') 4L >>> r.lrange('name_list', 0, -1) [b'Tom', b'John', b'Mary', b'Jan'] >>> r.llen('name_list') 4 >>> r.lindex('name_list', 1) b'John'
Besides the rpush()
and lrange()
functions we used in the example, we also want to introduce two others functions. First, the llen()
function is used to get the length of our list in the Redis for a given key. The lindex()
function is another way to retrieve an item of the list. We need to pass two parameters into the function: a key and an index of item in the list. The following table lists some other powerful functions in processing list data structure with Redis:
Function |
Description |
---|---|
|
Push value onto the tail of the list name if name exists |
|
Remove and return the last item of the list name |
|
Set item at the index position of the list name to input value |
|
Push value on the head of the list name if name exists |
|
Remove and return the first item of the list name |
Set
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
>>> r.sadd('country', 'USA') 1 >>> r.sadd('country', 'Italy') 1 >>> r.sadd('country', 'Singapore') 1 >>> r.sadd('country', 'Singapore') 0 >>> r.smembers('country') {b'Italy', b'Singapore', b'USA'} >>> r.srem('country', 'Singapore') 1 >>> r.smembers('country') {b'Italy', b'USA'}
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
Ordered set
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
>>> r.zadd('person:A', 10, 'sub:Math') 1 >>> r.zadd('person:A', 7, 'sub:Bio') 1 >>> r.zadd('person:A', 8, 'sub:Chem') 1 >>> r.zrange('person:A', 0, -1) [b'sub:Bio', b'sub:Chem', b'sub:Math'] >>> r.zrange('person:A', 0, -1, withscores=True) [(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
List
We have a few methods for interacting with list values in Redis. The following example uses rpush()
and lrange()
functions to put and get list data to and from the DB:
>>> r.rpush('name_list', 'Tom') 1L >>> r.rpush('name_list', 'John') 2L >>> r.rpush('name_list', 'Mary') 3L >>> r.rpush('name_list', 'Jan') 4L >>> r.lrange('name_list', 0, -1) [b'Tom', b'John', b'Mary', b'Jan'] >>> r.llen('name_list') 4 >>> r.lindex('name_list', 1) b'John'
Besides the rpush()
and lrange()
functions we used in the example, we also want to introduce two others functions. First, the llen()
function is used to get the length of our list in the Redis for a given key. The lindex()
function is another way to retrieve an item of the list. We need to pass two parameters into the function: a key and an index of item in the list. The following table lists some other powerful functions in processing list data structure with Redis:
Function |
Description |
---|---|
|
Push value onto the tail of the list name if name exists |
|
Remove and return the last item of the list name |
|
Set item at the index position of the list name to input value |
|
Push value on the head of the list name if name exists |
|
Remove and return the first item of the list name |
Set
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
>>> r.sadd('country', 'USA') 1 >>> r.sadd('country', 'Italy') 1 >>> r.sadd('country', 'Singapore') 1 >>> r.sadd('country', 'Singapore') 0 >>> r.smembers('country') {b'Italy', b'Singapore', b'USA'} >>> r.srem('country', 'Singapore') 1 >>> r.smembers('country') {b'Italy', b'USA'}
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
Ordered set
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
>>> r.zadd('person:A', 10, 'sub:Math') 1 >>> r.zadd('person:A', 7, 'sub:Bio') 1 >>> r.zadd('person:A', 8, 'sub:Chem') 1 >>> r.zrange('person:A', 0, -1) [b'sub:Bio', b'sub:Chem', b'sub:Math'] >>> r.zrange('person:A', 0, -1, withscores=True) [(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
Set
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
>>> r.sadd('country', 'USA') 1 >>> r.sadd('country', 'Italy') 1 >>> r.sadd('country', 'Singapore') 1 >>> r.sadd('country', 'Singapore') 0 >>> r.smembers('country') {b'Italy', b'Singapore', b'USA'} >>> r.srem('country', 'Singapore') 1 >>> r.smembers('country') {b'Italy', b'USA'}
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
Ordered set
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
>>> r.zadd('person:A', 10, 'sub:Math') 1 >>> r.zadd('person:A', 7, 'sub:Bio') 1 >>> r.zadd('person:A', 8, 'sub:Chem') 1 >>> r.zrange('person:A', 0, -1) [b'sub:Bio', b'sub:Chem', b'sub:Math'] >>> r.zrange('person:A', 0, -1, withscores=True) [(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
Ordered set
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
>>> r.zadd('person:A', 10, 'sub:Math') 1 >>> r.zadd('person:A', 7, 'sub:Bio') 1 >>> r.zadd('person:A', 8, 'sub:Chem') 1 >>> r.zrange('person:A', 0, -1) [b'sub:Bio', b'sub:Chem', b'sub:Math'] >>> r.zrange('person:A', 0, -1, withscores=True) [(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
Summary
We finished covering the basics of interacting with data in different commonly used storage mechanisms from the simple ones, such as text files, over more structured ones, such as HDF5, to more sophisticated data storage systems, such as MongoDB and Redis. The most suitable type of storage will depend on your use case. The choice of the data storage layer technology plays an important role in the overall design of data processing systems. Sometimes, we need to combine various database systems to store our data, such as complexity of the data, performance of the system or computation requirements.
Practice exercises
- Take a data set of your choice and design storage options for it. Consider text files, HDF5, a document database, and a data structure store as possible persistent options. Also evaluate how difficult (by some metric, for examples, how many lines of code) it would be to update or delete a specific item. Which storage type is the easiest to set up? Which storage type supports the most flexible queries?
- In Chapter 3, Data Analysis with Pandas we saw that it is possible to create hierarchical indices with Pandas. As an example, assume that you have data on each city with more than 1 million inhabitants and that we have a two level index, so we can address individual cities, but also whole countries. How would you represent this hierarchical relationship with the various storage options presented in this chapter: text files, HDF5, MongoDB, and Redis? What do you believe would be most convenient to work with in the long run?