Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning Cookbook

You're reading from   Python Data Cleaning Cookbook Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Arrow left icon
Product type Paperback
Published in May 2024
Publisher Packt
ISBN-13 9781803239873
Length 486 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. Anticipating Data Cleaning Issues When Importing Tabular Data with pandas 2. Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data FREE CHAPTER 3. Taking the Measure of Your Data 4. Identifying Outliers in Subsets of Data 5. Using Visualizations for the Identification of Unexpected Values 6. Cleaning and Exploring Data with Series Operations 7. Identifying and Fixing Missing Values 8. Encoding, Transforming, and Scaling Features 9. Fixing Messy Data When Aggregating 10. Addressing Data Issues When Combining DataFrames 11. Tidying and Reshaping Data 12. Automate Data Cleaning with User-Defined Functions, Classes, and Pipelines 13. Index

Importing data from SQL databases

In this recipe, we will use pymssql and mysql apis to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured, since it is designed to facilitate simultaneous transactions by members of organizations and those who interact with them. Each transaction is also likely related to some other organizational transaction.

This means that although data tables from enterprise systems such as these are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for merchandise at a retail store chain might be different in 1998 than they are in 2024. Similarly, frequently there are codes for missing values, such as 99,999, that pandas will understand as valid values.

Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. However, it also may involve adding some of the coding logic back after loading the SQL table into a pandas DataFrame. We explore how to do that in this recipe.

Getting ready

This recipe assumes you have pymssql and mysql apis installed. If you do not, it is relatively straightforward to install them with pip. From the Terminal, or powershell (in Windows), enter pip install pymssql or pip install mysql-connector-python. We will work with data on educational attainment in this recipe.

Data note

The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip.

How to do it...

We import SQL Server and MySQL data tables into a pandas DataFrame, as follows:

  1. Import pandas, numpy, pymssql, and mysql.

This step assumes that you have installed pymssql and mysql apis:

import pandas as pd
import numpy as np
import pymssql
import mysql.connector
  1. Use pymssql api and read_sql to retrieve and load data from a SQL Server instance.

Select the columns we want from the SQL Server data, and use SQL aliases to improve column names (for example, fedu AS fathereducation). Create a connection to the SQL Server data by passing database credentials to the pymssql connect function. Create a pandas DataFrame by passing the SELECT statement and connection object to read_sql. Use close to return the connection to the pool on the server:

sqlselect = "SELECT studentid, school, sex, age, famsize,\
...   medu AS mothereducation, fedu AS fathereducation,\
...   traveltime, studytime, failures, famrel, freetime,\
...   goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
...   g3 AS gradeperiod3 From studentmath"
server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdcctest"
conn = pymssql.connect(server=server,
...   user=user, password=password, database=database)
studentmath = pd.read_sql(sqlselect,conn)
conn.close()

Note

Although tools such as pymssql make connecting to a SQL Server instance relatively straightforward, the syntax still might take a little time to get used to if it is unfamiliar. The previous step shows the parameter values you will typically need to pass to a connection object – the name of the server, the name of a user with credentials on the server, the password for that user, and the name of a SQL database on the server.

  1. Check the data types and the first few rows:
    studentmath.dtypes
    
    studentid          object
    school             object
    sex                object
    age                int64
    famsize            object
    mothereducation    int64
    fathereducation    int64
    traveltime         int64
    studytime          int64
    failures           int64
    famrel             int64
    freetime           int64
    gout               int64
    gradeperiod1       int64
    gradeperiod2       int64
    gradeperiod3       int64
    dtype: object
    
    studentmath.head()
    
        studentid    school  ...      gradeperiod2    gradeperiod3
    0	001	    GP      ...	6	        6
    1	002	    GP      ...	5	        6
    2	003	    GP      ...	8	        10
    3	004	    GP      ...	14	        15
    4	005	    GP      ...	10	        10
    [5 rows x 16 columns]
    
  2. Connecting to a MySQL server is not very different from connecting to a SQL Server instance. We can use the connect method of the mysql connector to do that and then use read_sql to load the data.

Create a connection to the mysql data, pass that connection to read_sql to retrieve the data, and load it into a pandas DataFrame (the same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT statement we used in the previous step):

host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdccschema"
connmysql = mysql.connector.connect(host=host, \
...   database=database,user=user,password=password)
studentmath = pd.read_sql(sqlselect,connmysql)
connmysql.close()
  1. Rearrange the columns, set an index, and check for missing values.

Move the grade data to the left of the DataFrame, just after studentid. Also, move the freetime column to the right after traveltime and studytime. Confirm that each row has an ID and that the IDs are unique, and set studentid as the index:

newcolorder = ['studentid', 'gradeperiod1',
...   'gradeperiod2','gradeperiod3', 'school',
...   'sex', 'age', 'famsize','mothereducation',
...   'fathereducation', 'traveltime',
...   'studytime', 'freetime', 'failures',
...   'famrel','goout']
studentmath = studentmath[newcolorder]
studentmath.studentid.count()
395
studentmath.studentid.nunique()
395
studentmath.set_index('studentid', inplace=True)
  1. Use the DataFrame’s count function to check for missing values:
    studentmath.count()
    
    gradeperiod1		395
    gradeperiod2		395
    gradeperiod3		395
    school		395
    sex			395
    age			395
    famsize		395
    mothereducation	395
    fathereducation	395
    traveltime		395
    studytime		395
    freetime		395
    failures		395
    famrel		395
    goout			395
    dtype: int64
    
  2. Replace coded data values with more informative values.

Create a dictionary with the replacement values for the columns, and then use replace to set those values:

setvalues= \
...   {"famrel":{1:"1:very bad",2:"2:bad",
...     3:"3:neutral",4:"4:good",5:"5:excellent"},
...   "freetime":{1:"1:very low",2:"2:low",
...     3:"3:neutral",4:"4:high",5:"5:very high"},
...   "goout":{1:"1:very low",2:"2:low",3:"3:neutral",
...     4:"4:high",5:"5:very high"},
...   "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"},
...   "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"}}
studentmath.replace(setvalues, inplace=True)
  1. Change the type for columns with the changed data to category.

Check any changes in memory usage:

setvalueskeys = [k for k in setvalues]
studentmath[setvalueskeys].memory_usage(index=False)
famrel		3160
freetime		3160
goout			3160
mothereducation	3160
fathereducation	3160
dtype: int64
for col in studentmath[setvalueskeys].columns:
...   studentmath[col] = studentmath[col]. \
...     astype('category')
...
studentmath[setvalueskeys].memory_usage(index=False)
famrel		607
freetime		607
goout			607
mothereducation	599
fathereducation	599
dtype: int64
  1. Calculate percentages for values in the famrel column.

Run value_counts, and set normalize to True to generate percentages:

studentmath['famrel'].value_counts(sort=False, normalize=True)
1:very bad	0.02
2:bad		0.05
3:neutral	0.17
4:good	0.49
5:excellent	0.27
Name: famrel, dtype: float64
  1. Use apply to calculate percentages for multiple columns:
    studentmath[['freetime','goout']].\
    ...   apply(pd.Series.value_counts, sort=False,
    ...   normalize=True)
    
                 freetime   goout
    1:very low	0.05	    0.06
    2:low		0.16	    0.26
    3:neutral	0.40	    0.33
    4:high	0.29	    0.22
    5:very high	0.10	    0.13
    
    studentmath[['mothereducation','fathereducation']].\
    ...   apply(pd.Series.value_counts, sort=False,
    ...   normalize=True)
    
    			mothereducation	fathereducation
    1:k-4			0.15			0.21
    2:5-9			0.26			0.29
    3:secondary ed	0.25			0.25
    4:higher ed		0.33			0.24
    

The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning.

How it works…

Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. However, some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table to avoid entry errors and save on storage space. For example, 3 is stored for mother’s education rather than secondary education. It is a good idea to reconstruct that coding as early in the cleaning process as possible.

To pull data from a SQL database server, we need a connection object to authenticate us on the server, as well as a SQL select string. These can be passed to read_sql to retrieve the data and load it into a pandas DataFrame. I usually use the SQL SELECT statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I did that later in this recipe.

We set the index in Step 5, first confirming that every row has a value for studentid and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data further down the road. It will also likely improve the speed of the merge.

We use the DataFrame’s count function to check for missing values and that there are no missing values – for non-missing values, the count is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing – that is, valid numbers that nonetheless connote missing values, such as -1, 0, 9, or 99. We address this possibility in the next step.

Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column and then run it using replace. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category. We do this by generating a list of the keys of our setvalues dictionary – setvalueskeys = [k for k in setvalues] generates [famrel, freetime, goout, mothereducation, and fathereducation]. We then iterate over those five columns and use the astype method to change the data type to category. Notice that the memory usage for those columns is reduced substantially.

Finally, we check the assignment of new values by using value_counts to view relative frequencies. We use apply because we want to run value_counts on multiple columns. To prevent value_counts sorting by frequency, we set sort to False.

The DataFrame replace method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql. The 0 values for mothereducation and fathereducation seem to fall into that category. We fix this problem in the setvalues dictionary by indicating that the 0 values for mothereducation and fathereducation should be replaced with NaN. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work.

Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, as well as the proc format in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category data type. This is similar to factors in R.

There’s more…

I moved the grade data to near the beginning of the DataFrame. I find it helpful to have potential target or dependent variables in the leftmost columns, keeping them at the forefront of your mind. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex and age) are next to one another, as are family variables (mothereducation and fathereducation), and how students spend their time (traveltime, studytime, and freetime).

You could have used map instead of replace in Step 7. Prior to version 19.2 of pandas, map was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map.

See also

The recipes in Chapter 10, Addressing Data Issues When Combining DataFrames, go into detail on merging data. We will take a closer look at bivariate and multivariate relationships between variables in Chapter 4, Identifying Outliers in Subsets of Data. We will demonstrate how to use some of these same approaches in packages such as SPSS, SAS, and R in subsequent recipes in this chapter.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime