You're reading from Practical Data Analysis Cookbook Over 60 practical recipes on data exploration and analysis

Product type Paperback

Published in Apr 2016

Publisher

ISBN-13 9781783551668

Length 384 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Author (1):

Tomasz Drabas

View More author details

Table of Contents (13) Chapters

Preface

1. Preparing the Data FREE CHAPTER

2. Exploring the Data

3. Classification Techniques

4. Clustering Techniques

5. Reducing Dimensions

6. Regression Methods

7. Time Series Techniques

8. Graphs

9. Natural Language Processing

10. Discrete Choice Models

11. Simulations

Index

Storing and retrieving from a relational database

The relational database model was invented in 1970 at IBM. Since then, it reigned the field of data analytics and storage for decades. The model is still widely used but has been losing the field to more and more popular solutions such as Hadoop. Nevertheless, the demise of relational databases is nowhere near as it is still a tool of choice for many applications.

Getting ready

In order to execute the following recipe, you need pandas and SQLAlchemy modules installed. The SQLAlchemy is a module that abstracts the database interactions between Python scripts and a number of relational databases; effectively, you do not have to remember the specifics of each database's syntax as SQLAlchemy will handle that for you.

If you are using the Anaconda distribution, issue the following command:

conda install sqlalchemy

Refer to your distribution of Python to check how to install new modules. Alternatively, check the previous recipe for instructions on how to install modules manually.

In addition, you might need to install the psycopg2 module. This can be accomplished with the following command:

conda install psycopg2

If, however, you are not using Anaconda, the psycopg2 can be found at http://initd.org/psycopg/. Follow the installation instructions found on the website.

To execute the script in this recipe, you will also need a PostgreSQL database running locally. Go to http://www.postgresql.org/download/ and follow the installation instructions for your operating system contained therein. We assume that you have your PostgreSQL database installed up and running before you proceed. We also assume that your database can be accessed at localhost:5432. On a Unix-like system, you can check the port used by the PostgreSQL database by issuing the following command in the terminal:

cat /etc/services | grep postgre

No other prerequisites are required.

How to do it…

pandas works hand in hand with SQLAlchemy to make accessing and storing/retrieving data in/from many databases very easy. Reading the data and storing it in the database can be achieved with the following script (the store_postgresql.py file):

import pandas as pd
import sqlalchemy as sa

# name of the CSV file to read from
r_filenameCSV = '../../Data/Chapter01/realEstate_trans.csv'

# database credentials
usr  = 'drabast'
pswd = 'pAck7!B0ok'

# create the connection to the database
engine = sa.create_engine(
    'postgresql://{0}:{1}@localhost:5432/{0}' \
    .format(usr, pswd)
)

# read the data
csv_read = pd.read_csv(r_filenameCSV)

# transform sale_date to a datetime object
csv_read['sale_date'] = pd.to_datetime(csv_read['sale_date'])

# store the data in the database
csv_read.to_sql('real_estate', engine, if_exists='replace')

How it works…

First, we import all the necessary modules: pandas and SQLAlchemy. We also store the name of the CSV file we will be reading the data from in a variable, and we specify the credentials to be used to connect to our PostgreSQL database. Then, using SQLAchemy's create_engine(...) method, we create an object that allows us to access the PostgreSQL database. The connection string specific syntax can be broken down as follows:

sa.create_engine('postgresql://<user_name>:<password>@<server>:<port>/<database>')

Here, <user_name> is the username allowed to log in to <database> using <password>. The user needs (at a minimum) CREATE, DROP, INSERT, and SELECT privileges for the specified database. The <server> tag can be expressed as an IP address of the server running the PostgreSQL database or (as in our case) the name of the server (localhost). The <port> specifies the server port the database listens on.

Next, we read in the data from a CSV file and convert the sale_date column to a datetime object so that we can store the data in a date format in the database. The read_csv(...) method normally tries to infer the proper format for the data read from a file but it can get really tricky with dates. Here, we explicitly specify the date format for the sale_date column.

The last line of the script stores the information in the database. The to_sql(...) method specifies the name of the table ('real_estate') and connector (engine) to be used. The last parameter passed instructs the method to replace the table if it already exists in the database.

You can check whether the data has loaded properly using the following command:

query = 'SELECT * FROM real_estate LIMIT 10'
top10 = pd.read_sql_query(query, engine)
print(top10)

We first specify a valid SQL query and then use the read_sql_query(...) method to execute it.

There's more…

The most popular database in the world is SQLite. SQLite databases can be found in phones, TV sets, cars, among others; it makes SQLite the most widespread database. SQLite is very lightweight and requires no server to run. It can either store the data on a disk or use the memory of your computer to temporarily keep the data. The latter can be used when speed is required but the data disappears as soon as your script finishes.

With SQLAlchemy, it is also extremely easy to talk to the SQLite database. The only change required in the preceding example is how we construct the engine (the store_SQLite.py file):

# name of the SQLite database
rw_filenameSQLite = '../../Data/Chapter01/realEstate_trans.db'

# create the connection to the database
engine = sa.create_engine(
    'sqlite:///{0}'.format(rw_filenameSQLite)
)

As you can see, as the SQLite databases are serverless, the only required parameter is where to store the database file itself.

Tip

If, instead of storing the database in the file, you would like to keep your data in the computer's memory, use sqlite:// as the connection string.

Note

Note the three slashes in the path; this is to help the innards of the create_engine(...) method. At the most general level, the connection string follows the following pattern:

<database_type>://<server_information>/<database>

As SQLite databases do not require any server, <server_information> is empty and, hence, three slashes.