You're reading from Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Product type Paperback

Published in Dec 2020

Publisher Packt

ISBN-13 9781800565661

Length 436 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Michael B Walker

Michael Walker

View More author details

Table of Contents (12) Chapters

Preface

1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

2. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas FREE CHAPTER

3. Chapter 3: Taking the Measure of Your Data

4. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

5. Chapter 5: Using Visualizations for the Identification of Unexpected Values

6. Chapter 6: Cleaning and Exploring Data with Series Operations

7. Chapter 7: Fixing Messy Data when Aggregating

8. Chapter 8: Addressing Data Issues When Combining DataFrames

9. Chapter 9: Tidying and Reshaping Data

10. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

11. Other Books You May Enjoy

Leave a review - let other readers know what you think

Persisting JSON data

There are several reasons why we might want to serialize a JSON file:

We may have retrieved the data with an API, but need to keep a snapshot of the data.
The data in the JSON file is relatively static and informs our data cleaning and analysis over multiple phases of a project.
We might decide that the flexibility of a schema-less format such as JSON helps us solve many data cleaning and analysis problems.

It is worth highlighting this last reason to use JSON – that it can solve many data problems. Although tabular data structures clearly have many benefits, particularly for operational data, they are often not the best way to store data for analysis purposes. In preparing data for analysis, a substantial amount of time is spent either merging data from different tables or dealing with data redundancy when working with flat files. Not only are these processes time consuming, but every merge or reshaping leaves the door open to a data error of broad scope. This can also mean that we end up paying too much attention to the mechanics of manipulating data and too little to the conceptual issues at the core of our work.

We return to the Cleveland Museum of Art collections data in this recipe. There are at least three possible units of analysis for this data file – the collection item level, the creator level, and the citation level. JSON allows us to nest citations and creators within collections. (You can examine the structure of the JSON file in the Getting ready section of this recipe.) This data cannot be persisted in a tabular structure without flattening the file, which we did in an earlier recipe in this chapter. In this recipe, we will use two different methods to persist JSON data, each with its own advantages and disadvantages.

Getting ready

We will be working with data on the Cleveland Museum of Art's collection of works by African-American artists. The following is the structure of the JSON data returned by the API. It has been abbreviated to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

How to do it...

We will serialize the JSON data using two different methods:

Load the pandas, json, pprint, requests, and msgpack libraries:

>>> import pandas as pd
>>> import json
>>> import pprint
>>> import requests
>>> import msgpack

Load the JSON data from an API. I have abbreviated the JSON output:

>>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
>>> camcollections = json.loads(response.text)
>>> print(len(camcollections['data']))
778
>>> pprint.pprint(camcollections['data'][0])
{'accession_number': '2007.158',
 'catalogue_raisonne': None,
 'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                'page_number': 'Unpaginated, [8],[12]',
                'url': None},
               {'citation': '"Moscow to See Modern U.S....',
                'page_number': 'P. 60',
                'url': None}]
 'collection': 'American - Painting',
 'creation_date': '1958',
 'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
               'birth_year': '1917',
               'description': 'Jacob Lawrence (American...',
               'role': 'artist'}],
 'type': 'Painting'}

Save and reload the JSON file using Python's json library.

Persist the JSON data in human-readable form. Reload it from the saved file and confirm that it worked by retrieving the creators data from the first collections item:

>>> with open("data/camcollections.json","w") as f:
...   json.dump(camcollections, f)
... 
>>> with open("data/camcollections.json","r") as f:
...   camcollections = json.load(f)
... 
>>> pprint.pprint(camcollections['data'][0]['creators'])
[{'biography': 'Jacob Lawrence (born 1917) has been a prominent artist since...'
  'birth_year': '1917',
  'description': 'Jacob Lawrence (American, 1917-2000)',
  'role': 'artist'}]

Save and reload the JSON file using msgpack:

>>> with open("data/camcollections.msgpack", "wb") as outfile:
...     packed = msgpack.packb(camcollections)
...     outfile.write(packed)
... 
1586507
>>> with open("data/camcollections.msgpack", "rb") as data_file:
...     msgbytes = data_file.read()
... 
>>> camcollections = msgpack.unpackb(msgbytes)
>>> pprint.pprint(camcollections['data'][0]['creators'])
[{'biography': 'Jacob Lawrence (born 1917) has been a prominent...',
  'birth_year': '1917',
  'death_year': '2000',
  'description': 'Jacob Lawrence (American, 1917-2000)',
  'role': 'artist'}]

How it works…

We use the Cleveland Museum of Art's collections API to retrieve collections items. The african_american_artists flag in the query string indicates that we just want collections for those creators. json.loads returns a dictionary called info and a list of dictionaries called data. We check the length of the data list. This tells us that there are 778 items in collections. We then display the first item of collections to get a better look at the structure of the data. (I have abbreviated the JSON output.)

We save and then reload the data using Python's JSON library in step 3. The advantage of persisting the data in this way is that it keeps the data in human-readable form. Unfortunately, it has two disadvantages: saving takes longer than alternative serialization methods, and it uses more storage space.

In step 4, we use msgpack to persist our data. This is faster than Python's json library, and the saved file uses less space. Of course, the disadvantage is that the resulting JSON is binary rather than text-based.

There's more…

I use both methods for persisting JSON data in my work. When I am working with small amounts of data, and that data is relatively static, I prefer human-readable JSON. A great use case for this is the recipes in the previous chapter where we needed to create value labels.

I use msgpack when I am working with large amounts of data, where that data changes regularly. msgpack files are also great when you want to take regular snapshots of key tables in enterprise databases.

The Cleveland Museum of Art's collections data is similar in at least one important way to the data we work with every day. The unit of analysis frequently changes. Here we are looking at collections, citations, and creators. In our work, we might have to simultaneously look at students and courses, or households and deposits. An enterprise database system for the museum data would likely have separate collections, citations, and creators tables that we would eventually need to merge. The resulting merged file would have data redundancy issues that we would need to account for whenever we changed the unit of analysis.

When we alter our data cleaning process to work directly from JSON or parts of it, we end up eliminating a major source of errors. We do more data cleaning with JSON in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning.