You're reading from Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Product type Paperback

Published in Dec 2020

Publisher Packt

ISBN-13 9781800565661

Length 436 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Michael B Walker

Michael Walker

View More author details

Table of Contents (12) Chapters

Preface

1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

2. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas FREE CHAPTER

3. Chapter 3: Taking the Measure of Your Data

4. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

5. Chapter 5: Using Visualizations for the Identification of Unexpected Values

6. Chapter 6: Cleaning and Exploring Data with Series Operations

7. Chapter 7: Fixing Messy Data when Aggregating

8. Chapter 8: Addressing Data Issues When Combining DataFrames

9. Chapter 9: Tidying and Reshaping Data

10. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

11. Other Books You May Enjoy

Leave a review - let other readers know what you think

Importing more complicated JSON data from an API

In the previous recipe, we discussed one significant advantage (and challenge) of working with JSON data – its flexibility. A JSON file can have just about any structure its authors can imagine. This often means that this data does not have the tabular structure of the data sources we have discussed so far, and that pandas DataFrames have. Often, analysts and application developers use JSON precisely because it does not insist on a tabular structure. I know I do!

Retrieving data from multiple tables often requires us to do a one-to-many merge. Saving that data to one table or file means duplicating data on the "one" side of the one-to-many relationship. For example, student demographic data is merged with data on the courses studied, and the demographic data is repeated for each course. With JSON, duplication is not required to capture these items of data in one file. We can have data on the courses studied nested within the data for each student.

But doing analysis with JSON structured in this way will eventually require us to either: 1) manipulate the data in a very different way than we are used to doing; or 2) convert the JSON to a tabular form. We examine the first approach in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning. This recipe takes the second approach. It uses a very handy tool for converting selected nodes of JSON to a tabular structure – json_normalize.

We first use an API to get JSON data because that is how JSON is frequently consumed. One advantage of retrieving the data with an API, rather than working from a file we have saved locally, is that it is easier to rerun our code when the source data is refreshed.

Getting ready

This recipe assumes you have the requests and pprint libraries already installed. If they are not installed, you can install them with pip. From the terminal (or PowerShell in Windows), enter pip install requests and pip install pprint.

The following is the structure of the JSON file that is created when using the collections API of the Cleveland Museum of Art. There is a helpful info section at the beginning, but we are interested in the data section. This data does not fit nicely into a tabular data structure. There may be several citations objects and several creators objects for each collection object. I have abbreviated the JSON file to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

Note

The API used in this recipe is provided by the Cleveland Museum of Art. It is available for public use at https://openaccess-api.clevelandart.org/.

How to do it...

Create a DataFrame from the museum's collections data with one row for each citation, and the title and creation_date duplicated:

Import the json, requests, and pprint libraries.
We need the requests library to use an API to retrieve JSON data. pprint improves the display of lists and dictionaries:
```
>>> import pandas as pd
>>> import numpy as np
>>> import json
>>> import pprint
>>> import requests
```

Use an API to load the JSON data.

Make a get request to the collections API of the Cleveland Museum of Art. Use the query string to indicate that you just want collections from African-American artists. Display the first collection item. I have truncated the output for the first item to save space:

>>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
>>> camcollections = json.loads(response.text)
>>> print(len(camcollections['data']))
778
>>> pprint.pprint(camcollections['data'][0])
{'accession_number': '2007.158',
 'catalogue_raisonne': None,
 'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                'page_number': 'Unpaginated, [8],[12]',
                'url': None},
               {'citation': '"Moscow to See Modern U.S....',
                'page_number': 'P. 60',
                'url': None}]
 'collection': 'American - Painting',
 'creation_date': '1958',
 'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
               'birth_year': '1917',
               'description': 'Jacob Lawrence (American...)',
               'role': 'artist'}],
 'type': 'Painting'}

Flatten the JSON data.

Create a DataFrame from the JSON data using the json_normalize method. Indicate that the number of citations will determine the number of rows, and that accession_number, title, creation_date, collection, creators, and type will be repeated. Observe that the data has been flattened by displaying the first two observations, transposing them with the .T option to make it easier to view:

>>> camcollectionsdf=pd.json_normalize(camcollections['data'],/
 'citations',['accession_number','title','creation_date',/
 'collection','creators','type'])
>>> camcollectionsdf.head(2).T
                          0                       1
citation        Annual Exhibiti...  "Moscow to See Modern...
page_number           Unpaginated,                     P. 60
url                          None                       None
accession_number         2007.158                   2007.158
title            Fulton and No...           Fulton and No...
creation_date                1958                       1958
collection       American - Pa...           American - Pa...
creators   [{'description': 'J...     [{'description': 'J...
type                     Painting                   Painting

Pull the birth_year value from creators:

>>> creator = camcollectionsdf[:1].creators[0]
>>> type(creator[0])
<class 'dict'>
>>> pprint.pprint(creator)
[{'biography': 'Jacob Lawrence (born 1917) has been a prominent art...',
  'birth_year': '1917',
  'death_year': '2000',
  'description': 'Jacob Lawrence (American, 1917-2000)',
  'extent': None,
  'name_in_original_language': None,
  'qualifier': None,
  'role': 'artist'}]
>>> camcollectionsdf['birthyear'] = camcollectionsdf.\
...   creators.apply(lambda x: x[0]['birth_year'])
>>> camcollectionsdf.birthyear.value_counts().\
...   sort_index().head()
1821    18
1886     2
1888     1
1892    13
1899    17
Name: birthyear, dtype: int64

This gives us a pandas DataFrame with one row for each citation for each collection item, with the collection information (title, creation_date, and so on) duplicated.

How it works…

We work with a much more interesting JSON file in this recipe than in the previous one. Each object in the JSON file is an item in the collection of the Cleveland Museum of Art. Nested within each collection item are one or more citations. The only way to capture this information in a tabular DataFrame is to flatten it. There are also one or more dictionaries for creators of the collection item (the artist or artists). That dictionary (or dictionaries) contains the birth_year value that we want.

We want one row for every citation for all collection items. To understand this, imagine that we are working with relational data and have a collections table and a citations table, and that we are doing a one-to-many merge from collections to citations. We do something similar with json_normalize by using citations as the second parameter. That tells json_normalize to create one row for each citation and use the key values in each citation dictionary – for citation, page_number, and url – as data values.

The third parameter in the call to json_normalize has the list of column names for the data that will be repeated with each citation. Notice that access_number, title, creation_date, collection, creators, and type are repeated in observations one and two. Citation and page_number change. (url is the same value for the first and second citations. Otherwise, it would also change.)

This still leaves us with the problem of the creators dictionaries (there can be more than one creator). When we ran json_normalize it grabbed the value for each key we indicated (in the third parameter) and stored it in the data for that column and row, whether that value was simple text or a list of dictionaries, as is the case for creators. We take a look at the first (and in this case, only) creators item for the first collections row in step 10, naming it creator. (Note that the creators list is duplicated across all citations for a collection item, just as the values for title, creation_date, and so on are.)

We want the birth year for the first creator for each collection item, which can be found at creator[0]['birth_year']. To create a birthyear series using this, we use apply and a lambda function:

>>> camcollectionsdf['birthyear'] = camcollectionsdf.\
...   creators.apply(lambda x: x[0]['birth_year'])

We take a closer look at lambda functions in Chapter 6, Cleaning and Exploring Data with Series Operations. Here, it is helpful to think of the x as representing the creators series, so x[0] gives us the list item we want, creators[0]. We grab the value from the birth_year key.

There's more…

You may have noticed that we left out some of the JSON returned by the API in our call to json_normalize. The first parameter that we passed to json_normalize was camcollections['data']. Effectively, we ignore the info object at the beginning of the JSON data. The information we want does not start until the data object. This is not very different conceptually from the skiprows parameter in the second recipe of the previous chapter. There is sometimes metadata like this at the beginning of JSON files.