Persisting JSON data
There are several reasons why we might want to serialize a JSON file:
- We may have retrieved the data with an API, but need to keep a snapshot of the data.
- The data in the JSON file is relatively static and informs our data cleaning and analysis over multiple phases of a project.
- We might decide that the flexibility of a schema-less format such as JSON helps us solve many data cleaning and analysis problems.
It is worth highlighting this last reason to use JSON – that it can solve many data problems. Although tabular data structures clearly have many benefits, particularly for operational data, they are often not the best way to store data for analysis purposes. In preparing data for analysis, a substantial amount of time is spent either merging data from different tables or dealing with data redundancy when working with flat files. Not only are these processes time consuming, but every merge or reshaping leaves the door open to a data error of broad scope. This can also mean that we end up paying too much attention to the mechanics of manipulating data and too little to the conceptual issues at the core of our work.
We return to the Cleveland Museum of Art collections data in this recipe. There are at least three possible units of analysis for this data file – the collection item level, the creator level, and the citation level. JSON allows us to nest citations and creators within collections. (You can examine the structure of the JSON file in the Getting ready section of this recipe.) This data cannot be persisted in a tabular structure without flattening the file, which we did in an earlier recipe in this chapter. In this recipe, we will use two different methods to persist JSON data, each with its own advantages and disadvantages.
Getting ready
We will be working with data on the Cleveland Museum of Art's collection of works by African-American artists. The following is the structure of the JSON data returned by the API. It has been abbreviated to save space:
{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, "data": [ { "id": 165157, "accession_number": "2007.158", "title": "Fulton and Nostrand", "creation_date": "1958", "citations": [   {    "citation": "Annual Exhibition: Sculpture, Paintings...",    "page_number": "Unpaginated, [8],[12]",    "url": null    },   {    "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",      "page_number": "P. 60",    "url": null   }] "creators": [       {      "description": "Jacob Lawrence (American, 1917-2000)",      "extent": null,      "qualifier": null,      "role": "artist",      "birth_year": "1917",      "death_year": "2000"      }   ] }
How to do it...
We will serialize the JSON data using two different methods:
- Load the
pandas
,json
,pprint
,requests
, andmsgpack
libraries:>>> import pandas as pd >>> import json >>> import pprint >>> import requests >>> import msgpack
- Load the JSON data from an API. I have abbreviated the JSON output:
>>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists") >>> camcollections = json.loads(response.text) >>> print(len(camcollections['data'])) 778 >>> pprint.pprint(camcollections['data'][0]) {'accession_number': '2007.158', 'catalogue_raisonne': None, 'citations': [{'citation': 'Annual Exhibition: Sculpture...', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'page_number': 'Unpaginated, [8],[12]', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'url': None}, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â {'citation': '"Moscow to See Modern U.S....', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'page_number': 'P. 60', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'url': None}] 'collection': 'American - Painting', 'creation_date': '1958', 'creators': [{'biography': 'Jacob Lawrence (born 1917)...', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'birth_year': '1917', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'description': 'Jacob Lawrence (American...', Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 'role': 'artist'}], 'type': 'Painting'}
- Save and reload the JSON file using Python's
json
library.Persist the JSON data in human-readable form. Reload it from the saved file and confirm that it worked by retrieving the
creators
data from the first collections item:>>> with open("data/camcollections.json","w") as f: ...   json.dump(camcollections, f) ... >>> with open("data/camcollections.json","r") as f: ...   camcollections = json.load(f) ... >>> pprint.pprint(camcollections['data'][0]['creators']) [{'biography': 'Jacob Lawrence (born 1917) has been a prominent artist since...'   'birth_year': '1917',   'description': 'Jacob Lawrence (American, 1917-2000)',   'role': 'artist'}]
- Save and reload the JSON file using
msgpack
:>>> with open("data/camcollections.msgpack", "wb") as outfile: ...     packed = msgpack.packb(camcollections) ...     outfile.write(packed) ... 1586507 >>> with open("data/camcollections.msgpack", "rb") as data_file: ...     msgbytes = data_file.read() ... >>> camcollections = msgpack.unpackb(msgbytes) >>> pprint.pprint(camcollections['data'][0]['creators']) [{'biography': 'Jacob Lawrence (born 1917) has been a prominent...',   'birth_year': '1917',   'death_year': '2000',   'description': 'Jacob Lawrence (American, 1917-2000)',   'role': 'artist'}]
How it works…
We use the Cleveland Museum of Art's collections API to retrieve collections items. The african_american_artists
flag in the query string indicates that we just want collections for those creators. json.loads
returns a dictionary called info
and a list of dictionaries called data
. We check the length of the data
list. This tells us that there are 778 items in collections. We then display the first item of collections to get a better look at the structure of the data. (I have abbreviated the JSON output.)
We save and then reload the data using Python's JSON library in step 3. The advantage of persisting the data in this way is that it keeps the data in human-readable form. Unfortunately, it has two disadvantages: saving takes longer than alternative serialization methods, and it uses more storage space.
In step 4, we use msgpack
to persist our data. This is faster than Python's json
library, and the saved file uses less space. Of course, the disadvantage is that the resulting JSON is binary rather than text-based.
There's more…
I use both methods for persisting JSON data in my work. When I am working with small amounts of data, and that data is relatively static, I prefer human-readable JSON. A great use case for this is the recipes in the previous chapter where we needed to create value labels.
I use msgpack
when I am working with large amounts of data, where that data changes regularly. msgpack
files are also great when you want to take regular snapshots of key tables in enterprise databases.
The Cleveland Museum of Art's collections data is similar in at least one important way to the data we work with every day. The unit of analysis frequently changes. Here we are looking at collections, citations, and creators. In our work, we might have to simultaneously look at students and courses, or households and deposits. An enterprise database system for the museum data would likely have separate collections, citations, and creators tables that we would eventually need to merge. The resulting merged file would have data redundancy issues that we would need to account for whenever we changed the unit of analysis.
When we alter our data cleaning process to work directly from JSON or parts of it, we end up eliminating a major source of errors. We do more data cleaning with JSON in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning.