Python Data Cleaning Cookbook - Second Edition

Anticipating Data Cleaning Issues When Importing Tabular Data with pandas

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. The recipes in this book demonstrate how many common data preparation tasks can be done more easily with pandas than with other tools. While we work with pandas extensively in this book, we also use other popular packages such as Numpy, matplotlib, and scipy.

A key pandas object is the DataFrame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas DataFrame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.

Before we can make use of this great functionality, we have to import our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.

We examine tools to import tabular data in this recipe. Specifically, we cover the following topics:

Importing CSV files
Importing Excel files
Importing data from SQL databases
Importing SPSS, Stata, and SAS data
Importing R data
Persisting tabular data

Importing CSV files

The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas DataFrame. In this recipe, we import a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter, and then create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder, and then place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository, including the data files. Here is a screenshot of the beginning of the CSV file:

Figure 1.1: Land Temperatures Data

Data note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will import a CSV file into pandas, taking advantage of some very useful read_csv options:

Import the pandas library, and set up the environment to make viewing the output easier:

import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 85)
pd.set_option('display.max_columns', 8)

Read the data file, set new names for the headings, and parse the date column.

Pass an argument of 1 to the skiprows parameter to skip the first row, pass a list of columns to parse_dates to create a pandas datetime column from those columns, and set low_memory to False. This will cause pandas to load all of the data into memory at once, rather than in chunks. We do this so that pandas can identify the data type of each column automatically. In the There’s more… section, we see how to set the data type for each column manually:

landtemps = pd.read_csv('data/landtempssample.csv',
...     names=['stationid','year','month','avgtemp','latitude',
...       'longitude','elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)
type(landtemps)

<class 'pandas.core.frame.DataFrame'>

Note

We have to use skiprows because we are passing a list of column names to read_csv. If we use the column names in the CSV file, we do not need to specify values for either names or skiprows.

Get a quick glimpse of the data.

View the first few rows. Show the data type for all columns, as well as the number of rows and columns:

landtemps.head(7)

  month_year    stationid  ...  countryid              country
0 2000-04-01  USS0010K01S  ...         US        United States
1 1940-05-01  CI000085406  ...         CI                Chile
2 2013-12-01  USC00036376  ...         US        United States
3 1963-02-01  ASN00024002  ...         AS            Australia
4 2001-11-01  ASN00028007  ...         AS            Australia
5 1991-04-01  USW00024151  ...         US        United States
6 1993-12-01  RSM00022641  ...         RS               Russia
[7 rows x 9 columns]

landtemps.dtypes

month_year       datetime64[ns]
stationed        object
avgtemp          float64
latitude         float64
longitude        float64
elevation        float64
station          object
countryid        object
country          object
dtype: object

landtemps.shape

(100000, 9)

Give the date column a more appropriate name and view the summary statistics for average monthly temperature:

landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
landtemps.dtypes

measuredate      datetime64[ns]
stationid        object
avgtemp          float64
latitude         float64
longitude        float64
elevation        float64
station          object
countryid        object
country          object
dtype:           object

landtemps.avgtemp.describe()

count   85,554.00
mean    10.92
std     11.52
min     -70.70
25%     3.46
50%     12.22
75%     19.57
max     39.95
Name: avgtemp, dtype: float64

Look for missing values for each column.

Use isnull, which returns True for each value that is missing for each column, and False when not missing. Chain this with sum to count the missing values for each column. (When working with Boolean values, sum treats True as 1 and False as 0. I will discuss method chaining in the There’s more... section of this recipe):

landtemps.isnull().sum()

measuredate    0
stationed      0
avgtemp        14446
latitude       0
longitude      0
elevation      0
station        0
countryid      0
country        5
dtype: int64

Remove rows with missing data for avgtemp.

Use the subset parameter to tell dropna to drop rows when avgtemp is missing. Set inplace to True. Leaving inplace at its default value of False would display the DataFrame, but the changes we have made would not be retained. Use the shape attribute of the DataFrame to get the number of rows and columns:

landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.shape

(85554, 9)

That’s it! Importing CSV files into pandas is as simple as that.

How it works...

Almost all of the recipes in this book use the pandas library. We refer to it as pd to make it easier to reference later. This is customary. We also use float_format to display float values in a readable way and set_option to make the Terminal output wide enough to accommodate the number of variables.

Much of the work is done by the first line in Step 2. We use read_csv to load a pandas DataFrame in memory and call it landtemps. In addition to passing a filename, we set the names parameter to a list of our preferred column headings. We also tell read_csv to skip the first row, by setting skiprows to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row, read_csv will treat the header row in the file as actual data.

read_csv also solves a date conversion issue for us. We use the parse_dates parameter to ask it to convert the month and year columns to a date value.

Step 3 runs through a few standard data checks. We use head(7) to print out all columns for the first seven rows. We use the dtypes attribute of the DataFrame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values. shape returns a tuple, whose first element is the number of rows in the DataFrame (100,000 in this case) and whose second element is the number of columns (9).

When we used read_csv to parse the month and year columns, it gave the resulting column the name month_year. We used the rename method in Step 4 to give that column a more appropriate name. We need to specify inplace=True to replace the old column name with the new column name in memory. The describe method provides summary statistics on the avgtemp column.

Notice that the count for avgtemp indicates that there are 85,554 rows that have valid values for avgtemp. This is out of 100,000 rows for the whole DataFrame, as provided by the shape attribute. The listing of missing values for each column in Step 5 (landtemps.isnull().sum()) confirms this: 100,000 – 85,554 = 14,446.

Step 6 drops all rows where avgtemp is NaN. (The NaN value, not a number, is the pandas representation of missing values.) subset is used to indicate which column to check for missing values. The shape attribute for landtemps now indicates that there are 85,554 rows, which is what we would expect, given the previous count from describe.

There’s more...

If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the sep parameter of read_csv. When creating the pandas DataFrame, an index was also created. The numbers to the far left of the output when head was run are index values. Any number of rows can be specified for head. The default value is 5.

Instead of setting low_memory to False, to get pandas to make good guesses regarding data types, we could have set data types manually:

landtemps = pd.read_csv('data/landtempssample.csv',
    names=['stationid','year','month','avgtemp','latitude',
      'longitude','elevation','station','countryid','country'],
    skiprows=1,
    parse_dates=[['month','year']],
    dtype={'stationid':'object', 'avgtemp':'float64',
     'latitude':'float64','longitude':'float64',
     'elevation':'float64','station':'object',
     'countryid':'object','country':'object'},
    )
landtemps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype       
---  ------      --------------   -----       
 0   month_year  100000 non-null  datetime64[ns]
 1   stationid   100000 non-null  object      
 2   avgtemp     85554 non-null   float64     
 3   latitude    100000 non-null  float64     
 4   longitude   100000 non-null  float64     
 5   elevation   100000 non-null  float64     
 6   station     100000 non-null  object      
 7   countryid   100000 non-null  object      
 8   country     99995 non-null   object      
dtypes: datetime64[ns](1), float64(4), object(4)
memory usage: 6.9+ MB

The landtemps.isnull().sum() statement is an example of chaining methods. First, isnull returns a DataFrame of True and False values, resulting from testing whether each column value is null. The sum function takes that DataFrame and sums the True values for each column, interpreting the True values as 1 and the False values as 0. We would have obtained the same result if we had used the following two steps:

checknull = landtemps.isnull()
checknull.sum()

There is no hard and fast rule for when to chain methods and when not to do so. I find chaining helpful when the overall operation feels like a single step, even if it’s two or more steps mechanically. Chaining also has the side benefit of not creating extra objects that I might not need.

The dataset used in this recipe is just a sample from the full land temperatures database, with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:

landtemps = pd.read_csv('data/landtemps.zip',
...   compression='zip', names=['stationid','year',
...     'month','avgtemp','latitude','longitude',
...     'elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)

read_csv can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.

Importing Excel files

The read_excel method of the pandas library can be used to import data from an Excel file and load it into memory as a pandas DataFrame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets.

Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms with those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python.

Fortunately, read_excel has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows, select particular columns, and pull data from a particular sheet or sheets.

Getting ready

You can download the GDPpercapita22b.xlsx file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file (some columns were hidden for display purposes):

Figure 1.2: View of the dataset

And here is a view of the end of the file:

Figure 1.3: View of the dataset

Data note

This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/.

How to do it…

We import an Excel file into pandas and do some initial data cleaning:

Import the pandas library:
```
import pandas as pd
```
Read the Excel per capita GDP data.

Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the sheet_name parameter to specify the sheet. Set skiprows to 4 and skipfooter to 1 to skip the first four rows (the first row is hidden) and the last row. We provide values for usecols to get data from column A and columns C through W (column B is blank). Use head to view the first few rows and shape to get the number of rows and columns:

percapitaGDP = pd.read_excel("data/GDPpercapita22b.xlsx",
...    sheet_name="OECD.Stat export",
...    skiprows=4,
...    skipfooter=1,
...    usecols="A,C:W")
percapitaGDP.head()

                                  Year	2000	...	2019	2020
0	Metropolitan areas	       ...	NaN 	...	NaN	NaN
1	AUS: Australia	..	       ...	...  	...	...	...
2	AUS01: Greater Sydney	       ...	... 	... 	45576	45152
3	AUS02: Greater Melbourne     ...	... 	... 	42299	40848
4	AUS03: Greater Brisbane      ...	... 	... 	42145	40741
[5 rows x 22 columns]

percapitaGDP.shape

(731, 22)

Note

You may encounter a problem with read_excel if the Excel file does not use utf-8 encoding. One way to resolve this is to save the Excel file as a CSV file, reopen it, and then save it with utf-8 encoding.

Use the info method of the DataFrame to view data types and the non-null count. Notice that all columns have the object data type:

percapitaGDP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Year    731 non-null    object
 1   2000    730 non-null    object
 2   2001    730 non-null    object
 3   2002    730 non-null    object
 4   2003    730 non-null    object
 5   2004    730 non-null    object
 6   2005    730 non-null    object
 7   2006    730 non-null    object
 8   2007    730 non-null    object
 9   2008    730 non-null    object
 10  2009    730 non-null    object
 11  2010    730 non-null    object
 12  2011    730 non-null    object
 13  2012    730 non-null    object
 14  2013    730 non-null    object
 15  2014    730 non-null    object
 16  2015    730 non-null    object
 17  2016    730 non-null    object
 18  2017    730 non-null    object
 19  2018    730 non-null    object
 20  2019    730 non-null    object
 21  2020    730 non-null    object
dtypes: object(22)
memory usage: 125.8+ KB

Rename the Year column to metro, and remove the leading spaces.

Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases. We can test for leading spaces with startswith(' ') and then use any to establish whether there are one or more occasions when the first character is blank. We can use endswith(' ') to examine trailing spaces. We use strip to remove both leading and trailing spaces. When we test for trailing spaces again, we see that there are none:

percapitaGDP.rename(columns={'Year':'metro'}, inplace=True)
percapitaGDP.metro.str.startswith(' ').any()

True

percapitaGDP.metro.str.endswith(' ').any()

False

percapitaGDP.metro = percapitaGDP.metro.str.strip()
percapitaGDP.metro.str.startswith(' ').any()

False

Convert the data columns to numeric.

Iterate over all of the GDP year columns (2000–2020) and convert the data type from object to float. Coerce the conversion even when there is character data – the .. in this example. We want character values in those columns to become missing, which is what happens. Rename the year columns to better reflect the data in those columns:

for col in percapitaGDP.columns[1:]:
...   percapitaGDP[col] = pd.to_numeric(percapitaGDP[col],
...     errors='coerce')
...   percapitaGDP.rename(columns={col:'pcGDP'+col},
...     inplace=True)
...
percapitaGDP.head()

                      metro  pcGDP2000  pcGDP2001  ...  \
0        Metropolitan areas        NaN        NaN  ... 
1            AUS: Australia        NaN        NaN  ... 
2     AUS01: Greater Sydney        NaN      41091  ... 
3  AUS02: Greater Melbourne        NaN      40488  ... 
4   AUS03: Greater Brisbane        NaN      35276  ... 
   pcGDP2018  pcGDP2019  pcGDP2020
0        NaN        NaN        NaN
1        NaN        NaN        NaN
2      47171      45576      45152
3      43237      42299      40848
4      44328      42145      40741
[5 rows x 22 columns]

percapitaGDP.dtypes

metro          object
pcGDP2000      float64
pcGDP2001      float64
abbreviated to save space
pcGDP2019      float64
pcGDP2020      float64
dtype: object

Use the describe method to generate summary statistics for all numeric data in the DataFrame:

percapitaGDP.describe()

       pcGDP2000  pcGDP2001  pcGDP2002  ...  pcGDP2018  \
count        158        450        479  ...        692 
mean       33961      38874      39621  ...      41667 
std        15155      13194      13061  ...      17440 
min         2686       7805       7065  ...       5530 
25%        21523      30790      31064  ...      31322 
50%        35836      38078      39246  ...      41428 
75%        42804      46576      47874  ...      51130 
max        95221      96941      98929  ...     147760 
       pcGDP2019  pcGDP2020
count        596        425
mean       42709      39792
std        18893      19230
min         5698       5508
25%        29760      24142
50%        43505      41047
75%        53647      51130
max       146094     131082
[8 rows x 21 columns]

Remove rows where all of the per capita GDP values are missing.

Use the subset parameter of dropna to inspect all columns, starting with the second column (it is zero-based) and going through to the last column. Use how to specify that we want to drop rows only if all of the columns specified in subset are missing. Use shape to show the number of rows and columns in the resulting DataFrame:

percapitaGDP.dropna(subset=percapitaGDP.columns[1:], how="all", inplace=True)
percapitaGDP.shape

(692, 22)

Set the index for the DataFrame using the metropolitan area column.

Confirm that there are 692 valid values for metro and that there are 692 unique values, before setting the index:

percapitaGDP.metro.count()

percapitaGDP.metro.nunique()

percapitaGDP.set_index('metro', inplace=True)
percapitaGDP.head()

                          pcGDP2000  pcGDP2001  ...  \
metro                                           ... 
AUS01: Greater Sydney           NaN      41091  ... 
AUS02: Greater Melbourne        NaN      40488  ... 
AUS03: Greater Brisbane         NaN      35276  ... 
AUS04: Greater Perth            NaN      43355  ... 
AUS05: Greater Adelaide         NaN      36081  ... 
                          pcGDP2019  pcGDP2020
metro                                         
AUS01: Greater Sydney         45576      45152
AUS02: Greater Melbourne      42299      40848
AUS03: Greater Brisbane       42145      40741
AUS04: Greater Perth          70970      78489
AUS05: Greater Adelaide       38314      39181
[5 rows x 21 columns]
percapitaGDP.loc['AUS02: Greater Melbourne']
pcGDP2000     NaN
pcGDP2001   40488
...
pcGDP2019   42299
pcGDP2020   40848
Name: AUS02: Greater Melbourne, dtype: float64

We have now imported the Excel data into a pandas DataFrame and cleaned up some of the messiness in the spreadsheet.

How it works…

We mostly manage to get the data we want in Step 2 by skipping rows and columns we do not want, but there are still a number of issues – read_excel interprets all of the GDP data as character data, many rows are loaded with no useful data, and the column names do not represent the data well. In addition, the metropolitan area column might be useful as an index, but there are leading and trailing blanks, and there may be missing or duplicated values.

read_excel interprets Year as the column name for the metropolitan area data because it looks for a header above the data for that Excel column and finds Year there. We rename that column metro in Step 4. We also use strip to fix the problem with leading and trailing blanks. We could have just used lstrip to remove leading blanks, or rstrip if there had been trailing blanks. It is a good idea to assume that there might be leading or trailing blanks in any character data, cleaning that data shortly after the initial import.

The spreadsheet authors used .. to represent missing data. Since this is actually valid character data, those columns get the object data type (that is how pandas treats columns with character or mixed data). We coerce a conversion to numeric type in Step 5. This also results in the original values of .. being replaced with NaN (not a number), how pandas represents missing values for numbers. This is what we want.

We can fix all of the per capita GDP columns with just a few lines because pandas makes it easy to iterate over the columns of a DataFrame. By specifying [1:], we iterate from the second column to the last column. We can then change those columns to numeric and rename them to something more appropriate.

There are several reasons why it is a good idea to clean up the column headings for the annual GDP columns – it helps us to remember what the data actually is; if we merge it with other data by metropolitan area, we will not have to worry about conflicting variable names; and we can use attribute access to work with pandas Series based on those columns, which I will discuss in more detail in the There’s more… section of this recipe.

describe in Step 6 shows us that fewer than 500 rows have valid data for per capita GDP for some years. When we drop all rows that have missing values for all per capita GDP columns in step 7, we end up with 692 rows in the DataFrame.

There’s more…

Once we have a pandas DataFrame, we have the ability to treat columns as more than just columns. We can use attribute access (such as percapitaGPA.metro) or bracket notation (percapitaGPA['metro']) to get the functionality of a pandas Series. Either method makes it possible to use Series string inspecting methods, such as str.startswith, and counting methods, such as nunique. Note that the original column names of 20## did not allow attribute access because they started with a number, so percapitaGDP.pcGDP2001.count() works, but percapitaGDP.2001.count() returns a syntax error because 2001 is not a valid Python identifier (since it starts with a number).

pandas is rich with features for string manipulation and for Series operations. We will try many of them out in subsequent recipes. This recipe showed those that I find most useful when importing Excel data.

Importing data from SQL databases

In this recipe, we will use pymssql and mysql apis to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured, since it is designed to facilitate simultaneous transactions by members of organizations and those who interact with them. Each transaction is also likely related to some other organizational transaction.

This means that although data tables from enterprise systems such as these are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for merchandise at a retail store chain might be different in 1998 than they are in 2024. Similarly, frequently there are codes for missing values, such as 99,999, that pandas will understand as valid values.

Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. However, it also may involve adding some of the coding logic back after loading the SQL table into a pandas DataFrame. We explore how to do that in this recipe.

Getting ready

This recipe assumes you have pymssql and mysql apis installed. If you do not, it is relatively straightforward to install them with pip. From the Terminal, or powershell (in Windows), enter pip install pymssql or pip install mysql-connector-python. We will work with data on educational attainment in this recipe.

Data note

The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip.

How to do it...

We import SQL Server and MySQL data tables into a pandas DataFrame, as follows:

Import pandas, numpy, pymssql, and mysql.

This step assumes that you have installed pymssql and mysql apis:

import pandas as pd
import numpy as np
import pymssql
import mysql.connector

Use pymssql api and read_sql to retrieve and load data from a SQL Server instance.

Select the columns we want from the SQL Server data, and use SQL aliases to improve column names (for example, fedu AS fathereducation). Create a connection to the SQL Server data by passing database credentials to the pymssql connect function. Create a pandas DataFrame by passing the SELECT statement and connection object to read_sql. Use close to return the connection to the pool on the server:

sqlselect = "SELECT studentid, school, sex, age, famsize,\
...   medu AS mothereducation, fedu AS fathereducation,\
...   traveltime, studytime, failures, famrel, freetime,\
...   goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
...   g3 AS gradeperiod3 From studentmath"
server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdcctest"
conn = pymssql.connect(server=server,
...   user=user, password=password, database=database)
studentmath = pd.read_sql(sqlselect,conn)
conn.close()

Note

Although tools such as pymssql make connecting to a SQL Server instance relatively straightforward, the syntax still might take a little time to get used to if it is unfamiliar. The previous step shows the parameter values you will typically need to pass to a connection object – the name of the server, the name of a user with credentials on the server, the password for that user, and the name of a SQL database on the server.

Check the data types and the first few rows:

studentmath.dtypes

studentid          object
school             object
sex                object
age                int64
famsize            object
mothereducation    int64
fathereducation    int64
traveltime         int64
studytime          int64
failures           int64
famrel             int64
freetime           int64
gout               int64
gradeperiod1       int64
gradeperiod2       int64
gradeperiod3       int64
dtype: object

studentmath.head()

    studentid    school  ...      gradeperiod2    gradeperiod3
0	001	    GP      ...	6	        6
1	002	    GP      ...	5	        6
2	003	    GP      ...	8	        10
3	004	    GP      ...	14	        15
4	005	    GP      ...	10	        10
[5 rows x 16 columns]

Connecting to a MySQL server is not very different from connecting to a SQL Server instance. We can use the connect method of the mysql connector to do that and then use read_sql to load the data.

Create a connection to the mysql data, pass that connection to read_sql to retrieve the data, and load it into a pandas DataFrame (the same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT statement we used in the previous step):

host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
user = "pdccuser"
password = "pdccpass"
database = "pdccschema"
connmysql = mysql.connector.connect(host=host, \
...   database=database,user=user,password=password)
studentmath = pd.read_sql(sqlselect,connmysql)
connmysql.close()

Rearrange the columns, set an index, and check for missing values.

Move the grade data to the left of the DataFrame, just after studentid. Also, move the freetime column to the right after traveltime and studytime. Confirm that each row has an ID and that the IDs are unique, and set studentid as the index:

newcolorder = ['studentid', 'gradeperiod1',
...   'gradeperiod2','gradeperiod3', 'school',
...   'sex', 'age', 'famsize','mothereducation',
...   'fathereducation', 'traveltime',
...   'studytime', 'freetime', 'failures',
...   'famrel','goout']
studentmath = studentmath[newcolorder]
studentmath.studentid.count()

studentmath.studentid.nunique()

studentmath.set_index('studentid', inplace=True)

Use the DataFrame’s count function to check for missing values:

studentmath.count()

gradeperiod1		395
gradeperiod2		395
gradeperiod3		395
school		395
sex			395
age			395
famsize		395
mothereducation	395
fathereducation	395
traveltime		395
studytime		395
freetime		395
failures		395
famrel		395
goout			395
dtype: int64

Replace coded data values with more informative values.

Create a dictionary with the replacement values for the columns, and then use replace to set those values:

setvalues= \
...   {"famrel":{1:"1:very bad",2:"2:bad",
...     3:"3:neutral",4:"4:good",5:"5:excellent"},
...   "freetime":{1:"1:very low",2:"2:low",
...     3:"3:neutral",4:"4:high",5:"5:very high"},
...   "goout":{1:"1:very low",2:"2:low",3:"3:neutral",
...     4:"4:high",5:"5:very high"},
...   "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"},
...   "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
...     3:"3:secondary ed",4:"4:higher ed"}}
studentmath.replace(setvalues, inplace=True)

Change the type for columns with the changed data to category.

Check any changes in memory usage:

setvalueskeys = [k for k in setvalues]
studentmath[setvalueskeys].memory_usage(index=False)

famrel		3160
freetime		3160
goout			3160
mothereducation	3160
fathereducation	3160
dtype: int64

for col in studentmath[setvalueskeys].columns:
...   studentmath[col] = studentmath[col]. \
...     astype('category')
...
studentmath[setvalueskeys].memory_usage(index=False)

famrel		607
freetime		607
goout			607
mothereducation	599
fathereducation	599
dtype: int64

Calculate percentages for values in the famrel column.

Run value_counts, and set normalize to True to generate percentages:

studentmath['famrel'].value_counts(sort=False, normalize=True)

1:very bad	0.02
2:bad		0.05
3:neutral	0.17
4:good	0.49
5:excellent	0.27
Name: famrel, dtype: float64

Use apply to calculate percentages for multiple columns:

studentmath[['freetime','goout']].\
...   apply(pd.Series.value_counts, sort=False,
...   normalize=True)

             freetime   goout
1:very low	0.05	    0.06
2:low		0.16	    0.26
3:neutral	0.40	    0.33
4:high	0.29	    0.22
5:very high	0.10	    0.13

studentmath[['mothereducation','fathereducation']].\
...   apply(pd.Series.value_counts, sort=False,
...   normalize=True)

			mothereducation	fathereducation
1:k-4			0.15			0.21
2:5-9			0.26			0.29
3:secondary ed	0.25			0.25
4:higher ed		0.33			0.24

The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning.

How it works…

Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. However, some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table to avoid entry errors and save on storage space. For example, 3 is stored for mother’s education rather than secondary education. It is a good idea to reconstruct that coding as early in the cleaning process as possible.

To pull data from a SQL database server, we need a connection object to authenticate us on the server, as well as a SQL select string. These can be passed to read_sql to retrieve the data and load it into a pandas DataFrame. I usually use the SQL SELECT statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I did that later in this recipe.

We set the index in Step 5, first confirming that every row has a value for studentid and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data further down the road. It will also likely improve the speed of the merge.

We use the DataFrame’s count function to check for missing values and that there are no missing values – for non-missing values, the count is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing – that is, valid numbers that nonetheless connote missing values, such as -1, 0, 9, or 99. We address this possibility in the next step.

Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column and then run it using replace. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category. We do this by generating a list of the keys of our setvalues dictionary – setvalueskeys = [k for k in setvalues] generates [famrel, freetime, goout, mothereducation, and fathereducation]. We then iterate over those five columns and use the astype method to change the data type to category. Notice that the memory usage for those columns is reduced substantially.

Finally, we check the assignment of new values by using value_counts to view relative frequencies. We use apply because we want to run value_counts on multiple columns. To prevent value_counts sorting by frequency, we set sort to False.

The DataFrame replace method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql. The 0 values for mothereducation and fathereducation seem to fall into that category. We fix this problem in the setvalues dictionary by indicating that the 0 values for mothereducation and fathereducation should be replaced with NaN. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work.

Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, as well as the proc format in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category data type. This is similar to factors in R.

There’s more…

I moved the grade data to near the beginning of the DataFrame. I find it helpful to have potential target or dependent variables in the leftmost columns, keeping them at the forefront of your mind. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex and age) are next to one another, as are family variables (mothereducation and fathereducation), and how students spend their time (traveltime, studytime, and freetime).

You could have used map instead of replace in Step 7. Prior to version 19.2 of pandas, map was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map.

Importing SPSS, Stata, and SAS data

We will use pyreadstat to read data from three popular statistical packages into pandas. The key advantage of pyreadstat is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels.

The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat has the overall satisfaction with presentation variable label and 1–5 value labels, with 1 being not at all satisfied and 5 being highly satisfied.

The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat to load variable and value label information and use a couple of techniques to represent that information in pandas.

Getting ready

This recipe assumes you have installed the pyreadstat package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadstat. You will need the SPSS, Stata, and SAS data files for this recipe to run the code.

We will work with data from the United States National Longitudinal Surveys (NLS) of Youth.

Data note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high school age when they first completed the survey, having been born between 1980 and 1985. There were annual follow-up surveys each year through 2023. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository.

The original NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search, along with code for creating SPSS, Stata, or SAS files from the ASCII data files included in the download.

How to do it...

We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels:

Import pandas, numpy, and pyreadstat.

This step assumes that you have installed pyreadstat:

import pandas as pd
import numpy as np
import pyreadstat

Retrieve the SPSS data.

Pass a path and filename to the read_sav method of pyreadstat. Display the first few rows and a frequency distribution. Note that the column names and value labels are non-descriptive, and that read_sav returns both a pandas DataFrame and a meta object:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes

R0000100	float64
R0536300	float64
R0536401	float64
...
U2962900	float64
U2963000	float64
Z9063900	float64
dtype: object

nls97spss.head()

   R0000100  R0536300  ...  U2963000  Z9063900
0	1	2         ...  nan       52
1	2	1         ...  6         0
2	3	2         ...  6         0
3	4	2         ...  6         4
4	5	1         ...  5         12
[5 rows x 42 columns]

nls97spss['R0536300'].value_counts(normalize=True)

1.00	0.51
2.00	0.49
Name: R0536300, dtype: float64

Grab the metadata to improve column labels and value labels.

The metaspss object created when we called read_sav has the column labels and the value labels from the SPSS file. Use the variable_value_labels dictionary to map values to value labels for one column (R0536300). (This does not change the data. It only improves our display when we run value_counts.) Use the set_value_labels method to actually apply the value labels to the DataFrame:

metaspss.variable_value_labels['R0536300']

{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}

nls97spss['R0536300'].\
...   map(metaspss.variable_value_labels['R0536300']).\
...   value_counts(normalize=True)

Male		0.51
Female	0.49
Name: R0536300, dtype: float64

nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)

Use column labels in the metadata to rename the columns.

To use the column labels from metaspss in our DataFrame, we can simply assign the column labels in metaspss to our DataFrame’s column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters:

nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)

Male		0.51
Female	0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64

nls97spss.dtypes

PUBID - YTH ID CODE 		1997	float64
KEY!SEX (SYMBOL) 			1997	category
KEY!BDATE M/Y (SYMBOL)		1997	float64
KEY!BDATE M/Y (SYMBOL) 		1997	float64
CV_SAMPLE_TYPE 			1997	category
KEY!RACE_ETHNICITY (SYMBOL) 	1997	category
"... abbreviated to save space"
HRS/WK R WATCHES TELEVISION 	2017	category
HRS/NIGHT R SLEEPS 		 	2017	float64
CVC_WKSWK_YR_ALL L99			float64
dtype: object

nls97spss.columns = nls97spss.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)

Simplify the process by applying the value labels from the beginning.

The data values can actually be applied in the initial call to read_sav by setting apply_value_formats to True. This eliminates the need to call the set_value_labels function later:

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True)
nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
...   str.lower().\
...   str.replace(' ','_').\
...   str.replace('[^a-z0-9_]', '', regex=True)

Show the columns and a few rows:

nls97spss.dtypes

pubid__yth_id_code_1997	float64
keysex_symbol_1997		category
keybdate_my_symbol_1997	float64
keybdate_my_symbol_1997	float64
hrsnight_r_sleeps_2017	float64
cvc_wkswk_yr_all_l99	float64
dtype: object

nls97spss.head()

   pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
0	1	Female  ... 
1	2	Male  ... 
2	3	Female  ... 
3	4	Female  ... 
4	5	Male  ... 
   hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
0	nan	52
1	6	0
2	6	0
3	6	4
4	5	12
[5 rows x 42 columns]

Run frequencies on one of the columns, and set the index:

nls97spss.govt_responsibility__provide_jobs_2006.\
...   value_counts(sort=False)

Definitely should be	454
Definitely should not be	300
Probably should be		617
Probably should not be	462
Name: govt_responsibility__provide_jobs_2006, dtype: int64

nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)

That demonstrated how to convert data from SPSS. Let’s try that with Stata data.
Import the Stata data, apply value labels, and improve the column headings.

Use the same methods for the Stata data that we used for the SPSS data:

nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels
nls97stata.columns = nls97stata.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97stata.dtypes

pubid__yth_id_code_1997	float64
keysex_symbol_1997		category
keybdate_my_symbol_1997	float64
keybdate_my_symbol_1997	float64
hrsnight_r_sleeps_2017	float64
cvc_wkswk_yr_all_l99	float64
dtype: object

View a few rows of the data and run frequencies:

nls97stata.head()

   pubid__yth_id_code_1997    keysex_symbol_1997  ...  \
0                        1                Female  ... 
1                        2                  Male  ... 
2                        3                Female  ... 
3                        4                Female  ... 
4                        5                  Male  ... 
   hrsnight_r_sleeps_2017    cvc_wkswk_yr_all_l99
0                      -5                      52
1                       6                       0
2                       6                       0
3                       6                       4
4                       5                      12
[5 rows x 42 columns]

nls97stata.govt_responsibility__provide_jobs_2006.\
...   value_counts(sort=False)

-5.0	1425
-4.0	5665
-2.0	56
-1.0	5
Definitely should be	454
Definitely should not be	300
Probably should be		617
Probably should not be	462
Name: govt_responsibility__provide_jobs_2006, dtype: int64

Fix the logical missing values that show up with the Stata data and set an index. We can use the replace method to set any value that is between –9 and –1 in any column to missing:

nls97stata.min(numeric_only=True)

pubid__yth_id_code_1997          1
keybdate_my_symbol_1997          1
keybdate_my_symbol_1997      1,980
trans_sat_verbal_hstr           -4
trans_sat_math_hstr             -4
trans_crd_gpa_overall_hstr      -9
trans_crd_gpa_eng_hstr          -9
trans_crd_gpa_math_hstr         -9
trans_crd_gpa_lp_sci_hstr       -9
cv_ba_credits_l1_2011           -5
cv_bio_child_hh_2017            -5
cv_bio_child_nr_2017            -5
hrsnight_r_sleeps_2017          -5
cvc_wkswk_yr_all_l99            -4
dtype: float64

nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
nls97stata.min(numeric_only=True)

pubid__yth_id_code_1997          1
keybdate_my_symbol_1997          1
keybdate_my_symbol_1997      1,980
trans_sat_verbal_hstr           14
trans_sat_math_hstr              7
trans_crd_gpa_overall_hstr      10
trans_crd_gpa_eng_hstr           0
trans_crd_gpa_math_hstr          0
trans_crd_gpa_lp_sci_hstr        0
cv_ba_credits_l1_2011            0
cv_bio_child_hh_2017             0
cv_bio_child_nr_2017             0
hrsnight_r_sleeps_2017           0
cvc_wkswk_yr_all_l99             0
dtype: float64

nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)

The process is fairly similar when working with SAS data files, as the next few steps illustrate.

Retrieve the SAS data, using the SAS catalog file for value labels:

The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them:

nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
...     str.lower().\
...     str.replace(' ','_').\
...     str.replace('[^a-z0-9_]', '', regex=True)
nls97sas.head()

   pubid__yth_id_code_1997   keysex_symbol_1997    ...  \
0			     1		       Female    ... 
1			     2		         Male    ... 
2			     3		       Female    ... 
3			     4		       Female    ... 
4			     5		         Male    ... 
   hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99
0			  nan			   52
1			    6			    0
2			    6			    0
3			    6			    4
4			    5			   12
[5 rows x 42 columns]

nls97sas.keysex_symbol_1997.value_counts()

Male		4599
Female	4385
Name: keysex_symbol_1997, dtype: int64

nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)

This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata.

How it works...

The read_sav, read_dta, and read_sas7bdat methods of Pyreadstat, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats to True for SPSS and Stata files (Steps 5 and 8), or by providing a catalog file path and filename for SAS (Step 12).

We can set formats_as_category to True to change the data type to category for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas DataFrame column names at any point (nls97spss.columns = metaspss.column_labels). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names).

In Step 3, we looked at some of the SPSS data before applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']), but we could have viewed it for all variables (metaspss.variable_value_labels). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels function. This is a good approach when you do not know the data well and want to inspect the labels before applying them.

The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. However, the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters.

Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has nine possible values for missing, from –1 to –9. The SPSS import automatically set those values to NaN, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing to True.) For the Stata data, we need to tell it to replace all values from –1 to –9 with NaN. We do this by using the DataFrame’s replace function and passing it a list of integers from –9 to –1 (list(range(-9,0))).

There’s more…

You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels function is like the DataFrame replace operation we used to set value labels in that recipe. We passed a dictionary to replace that mapped columns to value labels. The set_value_labels function in this recipe essentially does the same thing, using the variable_value_labels property of the meta object as the dictionary.

Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data may combine individual and event-level data – a person and hospital visits, a brown bear and the date it emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis.

Importing R data

We will use pyreadr to read an R data file into pandas. Since pyreadr cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.

The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice versa, depending on their preferred packages. However, there is currently a scarcity of tools for reading data saved in R, as rds or rdata files, into Python. The analyst often saves the data as a CSV file first and then loads it into Python. We will use pyreadr, from the same author as pyreadstat, because it does not require an installation of R.

When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.

Getting ready

This recipe assumes you have installed the pyreadr package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadr.

We will again work with the NLS in this recipe. You will need to download the rds file used in this recipe from the GitHub repository in order to run the code.

How to do it…

We will import data from R without losing important metadata:

Load pandas, numpy, pprint, and the pyreadr package:

import pandas as pd
import numpy as np
import pyreadr
import pprint

Get the R data.

Pass the path and filename to the read_r method to retrieve the R data, and load it into memory as a pandas DataFrame. read_r can return one or more objects. When reading an rds file (as opposed to an rdata file), it will return one object, having the key None. We indicate None to get the pandas DataFrame:

nls97r = pyreadr.read_r('data/nls97.rds')[None]
nls97r.dtypes

R0000100	int32
R0536300	int32
...
U2962800	int32
U2962900	int32
U2963000	int32
Z9063900	int32
dtype: object

nls97r.head(10)

     R0000100  R0536300  ...      U2963000    Z9063900
0	1	  2         ...      -5          52
1	2	  1         ...       6          0
2	3	  2         ...       6          0
3	4	  2         ...       6          4
4	5	  1         ...       5          12
5	6	  2         ...       6          6
6	7	  1         ...      -5          0
7	8	  2         ...      -5          39
8	9	  1         ...       4          0
9	10	  1         ...       6          0
[10 rows x 42 columns]

Set up dictionaries for value labels and column headings.

Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:

with open('data/nlscodes.txt', 'r') as reader:
...     setvalues = eval(reader.read())
...
pprint.pprint(setvalues)

{'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
 'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
 'S8646900': {1.0: '1. Definitely',
              2.0: '2. Probably ',
              3.0: '3. Probably not',
              4.0: '4. Definitely not'}}
...abbreviated to save space
newcols = ['personid','gender','birthmonth',
...   'birthyear','sampletype','category',
...   'satverbal','satmath','gpaoverall',
...   'gpaeng','gpamath','gpascience','govjobs',
...   'govprices','govhealth','goveld','govind',
...   'govunemp','govinc','govcollege',
...   'govhousing','govenvironment','bacredits',
...   'coltype1','coltype2','coltype3','coltype4',
...   'coltype5','coltype6','highestgrade',
...   'maritalstatus','childnumhome','childnumaway',
...   'degreecol1','degreecol2','degreecol3',
...   'degreecol4','wageincome','weeklyhrscomputer',
...   'weeklyhrstv','nightlyhrssleep',
...   'weeksworkedlastyear']

Set value labels and missing values, and change selected columns to the category data type.

Use the setvalues dictionary to replace existing values with value labels. Replace all values from –9 to –1 with NaN:

nls97r.replace(setvalues, inplace=True)
nls97r.head()

     R0000100   R0536300  ...     U2963000  Z9063900
0    1          Female    ...     -5        52
1    2          Male      ...     6         0
2    3          Female    ...     6         0
3    4          Female    ...     6         4
4    5          Male      ...     5         12
[5 rows x 42 columns]

nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
...     nls97r[col] = nls97r[col].astype('category')
...
nls97r.dtypes

R0000100	int64
R0536300	category
R0536401	int64
R0536402	int64
R1235800	category
              ... 
U2857300	category
U2962800	category
U2962900	category
U2963000	float64
Z9063900	float64
Length: 42, dtype: object

Set meaningful column headings:

nls97r.columns = newcols
nls97r.dtypes

personid	int64
gender	category
birthmonth	int64
birthyear	int64
sampletype	category
                         ... 
wageincome	category
weeklyhrscomputer	category
weeklyhrstv	category
nightlyhrssleep	float64
weeksworkedlastyear	float64
Length: 42, dtype: object

This shows how R data files can be imported into pandas and value labels assigned.

How it works…

Reading R data into pandas with pyreadr is fairly straightforward. Passing a filename to the read_r function is all that is required. Since read_r can return multiple objects with one call, we need to specify which object. When reading an rds file (as opposed to an rdata file), only one object is returned. It has the key None.

In Step 3, we loaded a dictionary that maps our variables to value labels, and a list for our preferred column headings. In Step 4 we applied the value labels. We also changed the data type to category for the columns where we applied the values. We did this by generating a list of the keys in our setvalues dictionary with [k for k in setvalues] and then iterating over those columns.

We change the column headings in Step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues dictionary is based on the original column headings.

The main advantage of using pyreadr to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we work.

There’s more…

Pyreadr is able to return multiple DataFrames. This is useful when we save several data objects in R as an rdata file. We can return all of them with one call.

Pprint is a handy tool for improving the display of Python dictionaries.

We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that R also be installed, but it is more powerful than pyreadr. It will read R factors and automatically set them to pandas DataFrame values. See the following code:

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues

          R0000100      R0536300     ...    U2963000         Z9063900
1         1             Female       ...    -2147483648      52
2         2             Male         ...    6                0
3         3             Female       ...    6                0
4         4             Female       ...    6                4
5         5             Male         ...    5                12
...       ...           ...          ...    ...             ...
8980     9018           Female       ...    4                49
8981     9019           Male         ...    6                0
8982     9020           Male         ...    -2147483648      15
8983     9021           Male         ...    7                50
8984     9022           Female       ...    7                20
[8984 rows x 42 columns]

This generates unusual –2147483648 values. This is what happened when readRDS interpreted missing data in numeric columns. A global replacement of that number with NaN, after confirming that that is not a valid value, would be a good next step.

Persisting tabular data

We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it, to share the data with others, or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas DataFrame as different file types (CSV, Excel, Pickle, and Feather).

Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data-cleaning workflow.

In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word-processing application.

Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? A month ago? A year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data-cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work.

Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to Pickle and Feather. We save as all these file types in this recipe.

Note

Pickle and Feather are binary file formats that can be used to store pandas DataFrames.

Getting ready

You will need to install Feather if you do not have it on your system. You can do that by entering pip install pyarrow in a Terminal window or powershell (in Windows). If you do not already have a subfolder named Views in your chapter 1 folder, you will need to create it in order to run the code for this recipe.

Data note

How to do it…

We will load a CSV file into pandas and then save it as a Pickle and a Feather file. We will also save subsets of the data to the CSV and Excel formats:

Import pandas and pyarrow.

pyarrow needs to be imported in order to save pandas to Feather:

import pandas as pd
import pyarrow

Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index:

landtemps = \
...   pd.read_csv('data/landtempssample.csv',
...     names=['stationid','year','month','avgtemp',
...       'latitude','longitude','elevation',
...       'station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)
landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.dtypes

measuredate	datetime64[ns]
stationid	object
avgtemp	float64
latitude	float64
longitude	float64
elevation	float64
station	object
countryid	object
country	object
dtype: object

landtemps.set_index(['measuredate','stationid'], inplace=True)

Write extreme values for temperature to CSV and Excel files.

Use the quantile method to select outlier rows, which are those at the 1 in 1,000 level at each end of the distribution:

extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]
extremevals.shape

(171, 7)

extremevals.sample(7)

                           avgtemp  ...   country
measuredate  stationid              ...       
2013-08-01	QAM00041170	35.30    ...	Qatar
2005-01-01	RSM00024966	-40.09   ...	Russia
1973-03-01	CA002401200	-40.26   ...	Canada
2007-06-01	KU000405820	37.35    ...	Kuwait
1987-07-01	SUM00062700	35.50    ...	Sudan
1998-02-01	RSM00025325	-35.71   ...	Russia
1968-12-01	RSM00024329	-43.20   ...	Russia
[7 rows x 7 columns]

extremevals.to_excel('views/tempext.xlsx')
extremevals.to_csv('views/tempext.csv')

Save to Pickle and Feather files.

The index needs to be reset in order to save a Feather file:

landtemps.to_pickle('data/landtemps.pkl')
landtemps.reset_index(inplace=True)
landtemps.to_feather("data/landtemps.ftr")

Load the Pickle and Feather files we just saved.

Note that our index was preserved when saving and loading the Pickle file:

landtemps = pd.read_pickle('data/landtemps.pkl')
landtemps.head(2).T

measuredate	2000-04-01	1940-05-01
stationid	USS0010K01S	CI000085406
avgtemp	5.27		18.04
latitude	39.90		-18.35
longitude	-110.75		-70.33
elevation	2,773.70	58.00
station	INDIAN_CANYON	ARICA
countryid	US		CI
country	United States	Chile

landtemps = pd.read_feather("data/landtemps.ftr")
landtemps.head(2).T

                               0                    1
measuredate	2000-04-01 00:00:00	1940-05-01 00:00:00
stationid	USS0010K01S		CI000085406
avgtemp	5.27			18.04
latitude	39.90			-18.35
longitude	-110.75			-70.33
elevation	2,773.70		58.00
station	INDIAN_CANYON		ARICA
countryid	US			CI
country	United States		Chile

The previous steps demonstrated how to serialize pandas DataFrames using two different formats, Pickle and Feather.

How it works...

Persisting pandas data is quite straightforward. DataFrames have the to_csv, to_excel, to_pickle, and to_feather methods. Pickling preserves our index.

There’s more...

The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow, and we lose important metadata, such as data types. (read_csv can often figure out the data type when we reload the file, but not always.) Pickle files keep that data but can burden a system that is low on resources when serializing. Feather is easier on resources and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of Feather make no promises regarding long-term support.

You may have noticed that I do not make a global recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those “right tools for the right job” kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use Feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip and also using R. When I am wrapping up a project, I pickle the DataFrames.