Storing and retrieving from a relational database
The relational database model was invented in 1970 at IBM. Since then, it reigned the field of data analytics and storage for decades. The model is still widely used but has been losing the field to more and more popular solutions such as Hadoop. Nevertheless, the demise of relational databases is nowhere near as it is still a tool of choice for many applications.
Getting ready
In order to execute the following recipe, you need pandas
and SQLAlchemy
modules installed. The SQLAlchemy
is a module that abstracts the database interactions between Python scripts and a number of relational databases; effectively, you do not have to remember the specifics of each database's syntax as SQLAlchemy will handle that for you.
If you are using the Anaconda distribution, issue the following command:
conda install sqlalchemy
Refer to your distribution of Python to check how to install new modules. Alternatively, check the previous recipe for instructions on how to install modules manually.
In addition, you might need to install the psycopg2
module. This can be accomplished with the following command:
conda install psycopg2
If, however, you are not using Anaconda, the psycopg2
can be found at http://initd.org/psycopg/. Follow the installation instructions found on the website.
To execute the script in this recipe, you will also need a PostgreSQL database running locally. Go to http://www.postgresql.org/download/ and follow the installation instructions for your operating system contained therein. We assume that you have your PostgreSQL database installed up and running before you proceed. We also assume that your database can be accessed at localhost:5432
. On a Unix-like system, you can check the port used by the PostgreSQL database by issuing the following command in the terminal:
cat /etc/services | grep postgre
No other prerequisites are required.
How to do it…
pandas
works hand in hand with SQLAlchemy
to make accessing and storing/retrieving data in/from many databases very easy. Reading the data and storing it in the database can be achieved with the following script (the store_postgresql.py
file):
import pandas as pd import sqlalchemy as sa # name of the CSV file to read from r_filenameCSV = '../../Data/Chapter01/realEstate_trans.csv' # database credentials usr = 'drabast' pswd = 'pAck7!B0ok' # create the connection to the database engine = sa.create_engine( 'postgresql://{0}:{1}@localhost:5432/{0}' \ .format(usr, pswd) ) # read the data csv_read = pd.read_csv(r_filenameCSV) # transform sale_date to a datetime object csv_read['sale_date'] = pd.to_datetime(csv_read['sale_date']) # store the data in the database csv_read.to_sql('real_estate', engine, if_exists='replace')
How it works…
First, we import all the necessary modules: pandas
and SQLAlchemy
. We also store the name of the CSV file we will be reading the data from in a variable, and we specify the credentials to be used to connect to our PostgreSQL database. Then, using SQLAchemy's create_engine(...)
method, we create an object that allows us to access the PostgreSQL database. The connection string specific syntax can be broken down as follows:
sa.create_engine('postgresql://<user_name>:<password>@<server>:<port>/<database>')
Here, <user_name>
is the username allowed to log in to <database>
using <password>
. The user needs (at a minimum) CREATE, DROP, INSERT, and SELECT privileges for the specified database. The <server>
tag can be expressed as an IP address of the server running the PostgreSQL database or (as in our case) the name of the server (localhost
). The <port>
specifies the server port the database listens on.
Next, we read in the data from a CSV file and convert the sale_date
column to a datetime
object so that we can store the data in a date format in the database. The read_csv(...)
method normally tries to infer the proper format for the data read from a file but it can get really tricky with dates. Here, we explicitly specify the date format for the sale_date
column.
The last line of the script stores the information in the database. The to_sql(...)
method specifies the name of the table ('real_estate'
) and connector (engine
) to be used. The last parameter passed instructs the method to replace the table if it already exists in the database.
You can check whether the data has loaded properly using the following command:
query = 'SELECT * FROM real_estate LIMIT 10' top10 = pd.read_sql_query(query, engine) print(top10)
We first specify a valid SQL query and then use the
read_sql_query(...)
method to execute it.
There's more…
The most popular database in the world is SQLite. SQLite databases can be found in phones, TV sets, cars, among others; it makes SQLite the most widespread database. SQLite is very lightweight and requires no server to run. It can either store the data on a disk or use the memory of your computer to temporarily keep the data. The latter can be used when speed is required but the data disappears as soon as your script finishes.
With SQLAlchemy
, it is also extremely easy to talk to the SQLite database. The only change required in the preceding example is how we construct the engine (the store_SQLite.py
file):
# name of the SQLite database rw_filenameSQLite = '../../Data/Chapter01/realEstate_trans.db' # create the connection to the database engine = sa.create_engine( 'sqlite:///{0}'.format(rw_filenameSQLite) )
As you can see, as the SQLite databases are serverless, the only required parameter is where to store the database file itself.
Tip
If, instead of storing the database in the file, you would like to keep your data in the computer's memory, use sqlite://
as the connection string.
Note
Note the three slashes in the path; this is to help the innards of the create_engine(...)
method. At the most general level, the connection string follows the following pattern:
<database_type>://<server_information>/<database>
As SQLite databases do not require any server, <server_information>
is empty and, hence, three slashes.
See also
I highly recommend checking out the documentation for SQLAlchemy
as it is a very powerful middleman between your code and various databases; the documentation can be found at http://docs.sqlalchemy.org/en/rel_1_0/index.html.