Importing data from SQL databases
In this recipe, we will use pymssql
and mysql apis
to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured, since it is designed to facilitate simultaneous transactions by members of organizations and those who interact with them. Each transaction is also likely related to some other organizational transaction.
This means that although data tables from enterprise systems such as these are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for merchandise at a retail store chain might be different in 1998 than they are in 2024. Similarly, frequently there are codes for missing values, such as 99,999, that pandas will understand as valid values.
Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. However, it also may involve adding some of the coding logic back after loading the SQL table into a pandas DataFrame. We explore how to do that in this recipe.
Getting ready
This recipe assumes you have pymssql
and mysql apis
installed. If you do not, it is relatively straightforward to install them with pip
. From the Terminal, or powershell
(in Windows), enter pip install pymssql
or pip install mysql-connector-python
. We will work with data on educational attainment in this recipe.
Data note
The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip.
How to do it...
We import SQL Server and MySQL data tables into a pandas DataFrame, as follows:
- Import
pandas
,numpy
,pymssql
, andmysql
.
This step assumes that you have installed pymssql
and mysql apis
:
import pandas as pd
import numpy as np
import pymssql
import mysql.connector
- Use
pymssql api
andread_sql
to retrieve and load data from a SQL Server instance.
Select the columns we want from the SQL Server data, and use SQL aliases to improve column names (for example, fedu AS fathereducation
). Create a connection to the SQL Server data by passing database credentials to the pymssql
connect
function. Create a pandas DataFrame by passing the SELECT
statement and connection object to read_sql
. Use close
to return the connection to the pool on the server:
sqlselect = "SELECT studentid, school, sex, age, famsize,\
... medu AS mothereducation, fedu AS fathereducation,\
... traveltime, studytime, failures, famrel, freetime,\
... goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
... g3 AS gradeperiod3 From studentmath"
server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdcctest"
conn = pymssql.connect(server=server,
... user=user, password=password, database=database)
studentmath = pd.read_sql(sqlselect,conn)
conn.close()
Note
Although tools such as pymssql
make connecting to a SQL Server instance relatively straightforward, the syntax still might take a little time to get used to if it is unfamiliar. The previous step shows the parameter values you will typically need to pass to a connection object – the name of the server, the name of a user with credentials on the server, the password for that user, and the name of a SQL database on the server.
- Check the data types and the first few rows:
studentmath.dtypes
studentid object school object sex object age int64 famsize object mothereducation int64 fathereducation int64 traveltime int64 studytime int64 failures int64 famrel int64 freetime int64 gout int64 gradeperiod1 int64 gradeperiod2 int64 gradeperiod3 int64 dtype: object
studentmath.head()
studentid school ... gradeperiod2 gradeperiod3 0 001 GP ... 6 6 1 002 GP ... 5 6 2 003 GP ... 8 10 3 004 GP ... 14 15 4 005 GP ... 10 10 [5 rows x 16 columns]
- Connecting to a MySQL server is not very different from connecting to a SQL Server instance. We can use the
connect
method of themysql
connector to do that and then useread_sql
to load the data.
Create a connection to the mysql
data, pass that connection to read_sql
to retrieve the data, and load it into a pandas DataFrame (the same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT
statement we used in the previous step):
host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdccschema"
connmysql = mysql.connector.connect(host=host, \
... database=database,user=user,password=password)
studentmath = pd.read_sql(sqlselect,connmysql)
connmysql.close()
- Rearrange the columns, set an index, and check for missing values.
Move the grade data to the left of the DataFrame, just after studentid
. Also, move the freetime
column to the right after traveltime
and studytime
. Confirm that each row has an ID and that the IDs are unique, and set studentid
as the index:
newcolorder = ['studentid', 'gradeperiod1',
... 'gradeperiod2','gradeperiod3', 'school',
... 'sex', 'age', 'famsize','mothereducation',
... 'fathereducation', 'traveltime',
... 'studytime', 'freetime', 'failures',
... 'famrel','goout']
studentmath = studentmath[newcolorder]
studentmath.studentid.count()
395
studentmath.studentid.nunique()
395
studentmath.set_index('studentid', inplace=True)
- Use the DataFrame’s
count
function to check for missing values:studentmath.count()
gradeperiod1 395 gradeperiod2 395 gradeperiod3 395 school 395 sex 395 age 395 famsize 395 mothereducation 395 fathereducation 395 traveltime 395 studytime 395 freetime 395 failures 395 famrel 395 goout 395 dtype: int64
- Replace coded data values with more informative values.
Create a dictionary with the replacement values for the columns, and then use replace
to set those values:
setvalues= \
... {"famrel":{1:"1:very bad",2:"2:bad",
... 3:"3:neutral",4:"4:good",5:"5:excellent"},
... "freetime":{1:"1:very low",2:"2:low",
... 3:"3:neutral",4:"4:high",5:"5:very high"},
... "goout":{1:"1:very low",2:"2:low",3:"3:neutral",
... 4:"4:high",5:"5:very high"},
... "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
... 3:"3:secondary ed",4:"4:higher ed"},
... "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
... 3:"3:secondary ed",4:"4:higher ed"}}
studentmath.replace(setvalues, inplace=True)
- Change the type for columns with the changed data to
category
.
Check any changes in memory usage:
setvalueskeys = [k for k in setvalues]
studentmath[setvalueskeys].memory_usage(index=False)
famrel 3160
freetime 3160
goout 3160
mothereducation 3160
fathereducation 3160
dtype: int64
for col in studentmath[setvalueskeys].columns:
... studentmath[col] = studentmath[col]. \
... astype('category')
...
studentmath[setvalueskeys].memory_usage(index=False)
famrel 607
freetime 607
goout 607
mothereducation 599
fathereducation 599
dtype: int64
- Calculate percentages for values in the
famrel
column.
Run value_counts
, and set normalize
to True
to generate percentages:
studentmath['famrel'].value_counts(sort=False, normalize=True)
1:very bad 0.02
2:bad 0.05
3:neutral 0.17
4:good 0.49
5:excellent 0.27
Name: famrel, dtype: float64
- Use
apply
to calculate percentages for multiple columns:studentmath[['freetime','goout']].\ ... apply(pd.Series.value_counts, sort=False, ... normalize=True)
freetime goout 1:very low 0.05 0.06 2:low 0.16 0.26 3:neutral 0.40 0.33 4:high 0.29 0.22 5:very high 0.10 0.13
studentmath[['mothereducation','fathereducation']].\ ... apply(pd.Series.value_counts, sort=False, ... normalize=True)
mothereducation fathereducation 1:k-4 0.15 0.21 2:5-9 0.26 0.29 3:secondary ed 0.25 0.25 4:higher ed 0.33 0.24
The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning.
How it works…
Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. However, some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table to avoid entry errors and save on storage space. For example, 3
is stored for mother’s education rather than secondary education. It is a good idea to reconstruct that coding as early in the cleaning process as possible.
To pull data from a SQL database server, we need a connection object to authenticate us on the server, as well as a SQL select string. These can be passed to read_sql
to retrieve the data and load it into a pandas DataFrame. I usually use the SQL SELECT
statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I did that later in this recipe.
We set the index in Step 5, first confirming that every row has a value for studentid
and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data further down the road. It will also likely improve the speed of the merge.
We use the DataFrame’s count
function to check for missing values and that there are no missing values – for non-missing values, the count is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing – that is, valid numbers that nonetheless connote missing values, such as -1
, 0
, 9
, or 99
. We address this possibility in the next step.
Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column and then run it using replace
. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category
. We do this by generating a list of the keys of our setvalues
dictionary – setvalueskeys = [k for k in setvalues]
generates [famrel
, freetime
, goout
, mothereducation
, and fathereducation
]. We then iterate over those five columns and use the astype
method to change the data type to category
. Notice that the memory usage for those columns is reduced substantially.
Finally, we check the assignment of new values by using value_counts
to view relative frequencies. We use apply
because we want to run value_counts
on multiple columns. To prevent value_counts
sorting by frequency, we set sort to False
.
The DataFrame replace
method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql
. The 0
values for mothereducation
and fathereducation
seem to fall into that category. We fix this problem in the setvalues
dictionary by indicating that the 0
values for mothereducation
and fathereducation
should be replaced with NaN
. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work.
Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, as well as the proc
format in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category
data type. This is similar to factors
in R.
There’s more…
I moved the grade data to near the beginning of the DataFrame. I find it helpful to have potential target or dependent variables in the leftmost columns, keeping them at the forefront of your mind. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex and age) are next to one another, as are family variables (mothereducation
and fathereducation
), and how students spend their time (traveltime
, studytime
, and freetime
).
You could have used map
instead of replace
in Step 7. Prior to version 19.2 of pandas, map
was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map
.
See also
The recipes in Chapter 10, Addressing Data Issues When Combining DataFrames, go into detail on merging data. We will take a closer look at bivariate and multivariate relationships between variables in Chapter 4, Identifying Outliers in Subsets of Data. We will demonstrate how to use some of these same approaches in packages such as SPSS, SAS, and R in subsequent recipes in this chapter.