You're reading from Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Product type Paperback

Published in May 2024

Publisher Packt

ISBN-13 9781803239873

Length 486 pages

Edition 2nd Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Author (1):

Michael Walker

View More author details

Table of Contents (14) Chapters

Preface

1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas

2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data FREE CHAPTER

3. Taking the Measure of Your Data

4. Identifying Outliers in Subsets of Data

5. Using Visualizations for the Identification of Unexpected Values

6. Cleaning and Exploring Data with Series Operations

7. Identifying and Fixing Missing Values

8. Encoding, Transforming, and Scaling Features

9. Fixing Messy Data When Aggregating

10. Addressing Data Issues When Combining DataFrames

11. Tidying and Reshaping Data

12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines

13. Index

Importing data from SQL databases

In this recipe, we will use pymssql and mysql apis to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured, since it is designed to facilitate simultaneous transactions by members of organizations and those who interact with them. Each transaction is also likely related to some other organizational transaction.

This means that although data tables from enterprise systems such as these are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for merchandise at a retail store chain might be different in 1998 than they are in 2024. Similarly, frequently there are codes for missing values, such as 99,999, that pandas will understand as valid values.

Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. However, it also may involve adding some of the coding logic back after loading the SQL table into a pandas DataFrame. We explore how to do that in this recipe.

Getting ready

This recipe assumes you have pymssql and mysql apis installed. If you do not, it is relatively straightforward to install them with pip. From the Terminal, or powershell (in Windows), enter pip install pymssql or pip install mysql-connector-python. We will work with data on educational attainment in this recipe.

Data note

The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip.

How to do it...

We import SQL Server and MySQL data tables into a pandas DataFrame, as follows:

Import pandas, numpy, pymssql, and mysql.

This step assumes that you have installed pymssql and mysql apis:

import pandas as pd
import numpy as np
import pymssql
import mysql.connector

Use pymssql api and read_sql to retrieve and load data from a SQL Server instance.

Select the columns we want from the SQL Server data, and use SQL aliases to improve column names (for example, fedu AS fathereducation). Create a connection to the SQL Server data by passing database credentials to the pymssql connect function. Create a pandas DataFrame by passing the SELECT statement and connection object to read_sql. Use close to return the connection to the pool on the server:

sqlselect = "SELECT studentid, school, sex, age, famsize,\
...   medu AS mothereducation, fedu AS fathereducation,\
...   traveltime, studytime, failures, famrel, freetime,\
...   goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
...   g3 AS gradeperiod3 From studentmath"
server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdcctest"
conn = pymssql.connect(server=server,
...   user=user, password=password, database=database)
studentmath = pd.read_sql(sqlselect,conn)
conn.close()

Note

Although tools such as pymssql make connecting to a SQL Server instance relatively straightforward, the syntax still might take a little time to get used to if it is unfamiliar. The previous step shows the parameter values you will typically need to pass to a connection object – the name of the server, the name of a user with credentials on the server, the password for that user, and the name of a SQL database on the server.

Check the data types and the first few rows:

studentmath.dtypes

studentid          object
school             object
sex                object
age                int64
famsize            object
mothereducation    int64
fathereducation    int64
traveltime         int64
studytime          int64
failures           int64
famrel             int64
freetime           int64
gout               int64
gradeperiod1       int64
gradeperiod2       int64
gradeperiod3       int64
dtype: object

studentmath.head()

    studentid    school  ...      gradeperiod2    gradeperiod3
0	001	    GP      ...	6	        6
1	002	    GP      ...	5	        6
2	003	    GP      ...	8	        10
3	004	    GP      ...	14	        15
4	005	    GP      ...	10	        10
[5 rows x 16 columns]

Connecting to a MySQL server is not very different from connecting to a SQL Server instance. We can use the connect method of the mysql connector to do that and then use read_sql to load the data.

Create a connection to the mysql data, pass that connection to read_sql to retrieve the data, and load it into a pandas DataFrame (the same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT statement we used in the previous step):

host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdccschema"
connmysql = mysql.connector.connect(host=host, \
...   database=database,user=user,password=password)
studentmath = pd.read_sql(sqlselect,connmysql)
connmysql.close()

Rearrange the columns, set an index, and check for missing values.

Move the grade data to the left of the DataFrame, just after studentid. Also, move the freetime column to the right after traveltime and studytime. Confirm that each row has an ID and that the IDs are unique, and set studentid as the index:

newcolorder = ['studentid', 'gradeperiod1',
...   'gradeperiod2','gradeperiod3', 'school',
...   'sex', 'age', 'famsize','mothereducation',
...   'fathereducation', 'traveltime',
...   'studytime', 'freetime', 'failures',
...   'famrel','goout']
studentmath = studentmath[newcolorder]
studentmath.studentid.count()

studentmath.studentid.nunique()

studentmath.set_index('studentid', inplace=True)

Use the DataFrame’s count function to check for missing values:

studentmath.count()

gradeperiod1		395
gradeperiod2		395
gradeperiod3		395
school		395
sex			395
age			395
famsize		395
mothereducation	395
fathereducation	395
traveltime		395
studytime		395
freetime		395
failures		395
famrel		395
goout			395
dtype: int64

Replace coded data values with more informative values.

Create a dictionary with the replacement values for the columns, and then use replace to set those values:

setvalues= \
...   {"famrel":{1:"1:very bad",2:"2:bad",
...     3:"3:neutral",4:"4:good",5:"5:excellent"},
...   "freetime":{1:"1:very low",2:"2:low",
...     3:"3:neutral",4:"4:high",5:"5:very high"},
...   "goout":{1:"1:very low",2:"2:low",3:"3:neutral",
...     4:"4:high",5:"5:very high"},
...   "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"},
...   "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"}}
studentmath.replace(setvalues, inplace=True)

Change the type for columns with the changed data to category.

Check any changes in memory usage:

setvalueskeys = [k for k in setvalues]
studentmath[setvalueskeys].memory_usage(index=False)

famrel		3160
freetime		3160
goout			3160
mothereducation	3160
fathereducation	3160
dtype: int64

for col in studentmath[setvalueskeys].columns:
...   studentmath[col] = studentmath[col]. \
...     astype('category')
...
studentmath[setvalueskeys].memory_usage(index=False)

famrel		607
freetime		607
goout			607
mothereducation	599
fathereducation	599
dtype: int64

Calculate percentages for values in the famrel column.

Run value_counts, and set normalize to True to generate percentages:

studentmath['famrel'].value_counts(sort=False, normalize=True)

1:very bad	0.02
2:bad		0.05
3:neutral	0.17
4:good	0.49
5:excellent	0.27
Name: famrel, dtype: float64

Use apply to calculate percentages for multiple columns:

studentmath[['freetime','goout']].\
...   apply(pd.Series.value_counts, sort=False,
...   normalize=True)

             freetime   goout
1:very low	0.05	    0.06
2:low		0.16	    0.26
3:neutral	0.40	    0.33
4:high	0.29	    0.22
5:very high	0.10	    0.13

studentmath[['mothereducation','fathereducation']].\
...   apply(pd.Series.value_counts, sort=False,
...   normalize=True)

			mothereducation	fathereducation
1:k-4			0.15			0.21
2:5-9			0.26			0.29
3:secondary ed	0.25			0.25
4:higher ed		0.33			0.24

The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning.

How it works…

Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. However, some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table to avoid entry errors and save on storage space. For example, 3 is stored for mother’s education rather than secondary education. It is a good idea to reconstruct that coding as early in the cleaning process as possible.

To pull data from a SQL database server, we need a connection object to authenticate us on the server, as well as a SQL select string. These can be passed to read_sql to retrieve the data and load it into a pandas DataFrame. I usually use the SQL SELECT statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I did that later in this recipe.

We set the index in Step 5, first confirming that every row has a value for studentid and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data further down the road. It will also likely improve the speed of the merge.

We use the DataFrame’s count function to check for missing values and that there are no missing values – for non-missing values, the count is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing – that is, valid numbers that nonetheless connote missing values, such as -1, 0, 9, or 99. We address this possibility in the next step.

Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column and then run it using replace. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category. We do this by generating a list of the keys of our setvalues dictionary – setvalueskeys = [k for k in setvalues] generates [famrel, freetime, goout, mothereducation, and fathereducation]. We then iterate over those five columns and use the astype method to change the data type to category. Notice that the memory usage for those columns is reduced substantially.

Finally, we check the assignment of new values by using value_counts to view relative frequencies. We use apply because we want to run value_counts on multiple columns. To prevent value_counts sorting by frequency, we set sort to False.

The DataFrame replace method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql. The 0 values for mothereducation and fathereducation seem to fall into that category. We fix this problem in the setvalues dictionary by indicating that the 0 values for mothereducation and fathereducation should be replaced with NaN. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work.

Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, as well as the proc format in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category data type. This is similar to factors in R.

There’s more…

I moved the grade data to near the beginning of the DataFrame. I find it helpful to have potential target or dependent variables in the leftmost columns, keeping them at the forefront of your mind. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex and age) are next to one another, as are family variables (mothereducation and fathereducation), and how students spend their time (traveltime, studytime, and freetime).

You could have used map instead of replace in Step 7. Prior to version 19.2 of pandas, map was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map.