Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Data Cleaning Cookbook

You're reading from   Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Arrow left icon
Product type Paperback
Published in Dec 2020
Publisher Packt
ISBN-13 9781800565661
Length 436 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Michael B Walker Michael B Walker
Author Profile Icon Michael B Walker
Michael B Walker
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas 2. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas FREE CHAPTER 3. Chapter 3: Taking the Measure of Your Data 4. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data 5. Chapter 5: Using Visualizations for the Identification of Unexpected Values 6. Chapter 6: Cleaning and Exploring Data with Series Operations 7. Chapter 7: Fixing Messy Data when Aggregating 8. Chapter 8: Addressing Data Issues When Combining DataFrames 9. Chapter 9: Tidying and Reshaping Data 10. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning 11. Other Books You May Enjoy

Persisting JSON data

There are several reasons why we might want to serialize a JSON file:

  • We may have retrieved the data with an API, but need to keep a snapshot of the data.
  • The data in the JSON file is relatively static and informs our data cleaning and analysis over multiple phases of a project.
  • We might decide that the flexibility of a schema-less format such as JSON helps us solve many data cleaning and analysis problems.

It is worth highlighting this last reason to use JSON – that it can solve many data problems. Although tabular data structures clearly have many benefits, particularly for operational data, they are often not the best way to store data for analysis purposes. In preparing data for analysis, a substantial amount of time is spent either merging data from different tables or dealing with data redundancy when working with flat files. Not only are these processes time consuming, but every merge or reshaping leaves the door open to a data error of broad scope. This can also mean that we end up paying too much attention to the mechanics of manipulating data and too little to the conceptual issues at the core of our work.

We return to the Cleveland Museum of Art collections data in this recipe. There are at least three possible units of analysis for this data file – the collection item level, the creator level, and the citation level. JSON allows us to nest citations and creators within collections. (You can examine the structure of the JSON file in the Getting ready section of this recipe.) This data cannot be persisted in a tabular structure without flattening the file, which we did in an earlier recipe in this chapter. In this recipe, we will use two different methods to persist JSON data, each with its own advantages and disadvantages.

Getting ready

We will be working with data on the Cleveland Museum of Art's collection of works by African-American artists. The following is the structure of the JSON data returned by the API. It has been abbreviated to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

How to do it...

We will serialize the JSON data using two different methods:

  1. Load the pandas, json, pprint, requests, and msgpack libraries:
    >>> import pandas as pd
    >>> import json
    >>> import pprint
    >>> import requests
    >>> import msgpack
  2. Load the JSON data from an API. I have abbreviated the JSON output:
    >>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    778
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Save and reload the JSON file using Python's json library.

    Persist the JSON data in human-readable form. Reload it from the saved file and confirm that it worked by retrieving the creators data from the first collections item:

    >>> with open("data/camcollections.json","w") as f:
    ...   json.dump(camcollections, f)
    ... 
    >>> with open("data/camcollections.json","r") as f:
    ...   camcollections = json.load(f)
    ... 
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent artist since...'
      'birth_year': '1917',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]
  4. Save and reload the JSON file using msgpack:
    >>> with open("data/camcollections.msgpack", "wb") as outfile:
    ...     packed = msgpack.packb(camcollections)
    ...     outfile.write(packed)
    ... 
    1586507
    >>> with open("data/camcollections.msgpack", "rb") as data_file:
    ...     msgbytes = data_file.read()
    ... 
    >>> camcollections = msgpack.unpackb(msgbytes)
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]

How it works…

We use the Cleveland Museum of Art's collections API to retrieve collections items. The african_american_artists flag in the query string indicates that we just want collections for those creators. json.loads returns a dictionary called info and a list of dictionaries called data. We check the length of the data list. This tells us that there are 778 items in collections. We then display the first item of collections to get a better look at the structure of the data. (I have abbreviated the JSON output.)

We save and then reload the data using Python's JSON library in step 3. The advantage of persisting the data in this way is that it keeps the data in human-readable form. Unfortunately, it has two disadvantages: saving takes longer than alternative serialization methods, and it uses more storage space.

In step 4, we use msgpack to persist our data. This is faster than Python's json library, and the saved file uses less space. Of course, the disadvantage is that the resulting JSON is binary rather than text-based.

There's more…

I use both methods for persisting JSON data in my work. When I am working with small amounts of data, and that data is relatively static, I prefer human-readable JSON. A great use case for this is the recipes in the previous chapter where we needed to create value labels.

I use msgpack when I am working with large amounts of data, where that data changes regularly. msgpack files are also great when you want to take regular snapshots of key tables in enterprise databases.

The Cleveland Museum of Art's collections data is similar in at least one important way to the data we work with every day. The unit of analysis frequently changes. Here we are looking at collections, citations, and creators. In our work, we might have to simultaneously look at students and courses, or households and deposits. An enterprise database system for the museum data would likely have separate collections, citations, and creators tables that we would eventually need to merge. The resulting merged file would have data redundancy issues that we would need to account for whenever we changed the unit of analysis.

When we alter our data cleaning process to work directly from JSON or parts of it, we end up eliminating a major source of errors. We do more data cleaning with JSON in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning.

You have been reading a chapter from
Python Data Cleaning Cookbook
Published in: Dec 2020
Publisher: Packt
ISBN-13: 9781800565661
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime