Importing more complicated JSON data from an API
In the previous recipe, we discussed one significant advantage (and challenge) of working with JSON data – its flexibility. A JSON file can have just about any structure its authors can imagine. This often means that this data does not have the tabular structure of the data sources we have discussed so far, and that pandas DataFrames have. Often, analysts and application developers use JSON precisely because it does not insist on a tabular structure. I know I do!
Retrieving data from multiple tables often requires us to do a one-to-many merge. Saving that data to one table or file means duplicating data on the "one" side of the one-to-many relationship. For example, student demographic data is merged with data on the courses studied, and the demographic data is repeated for each course. With JSON, duplication is not required to capture these items of data in one file. We can have data on the courses studied nested within the data for each student.
But doing analysis with JSON structured in this way will eventually require us to either: 1) manipulate the data in a very different way than we are used to doing; or 2) convert the JSON to a tabular form. We examine the first approach in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning. This recipe takes the second approach. It uses a very handy tool for converting selected nodes of JSON to a tabular structure – json_normalize
.
We first use an API to get JSON data because that is how JSON is frequently consumed. One advantage of retrieving the data with an API, rather than working from a file we have saved locally, is that it is easier to rerun our code when the source data is refreshed.
Getting ready
This recipe assumes you have the requests
and pprint
libraries already installed. If they are not installed, you can install them with pip. From the terminal (or PowerShell in Windows), enter pip
install
requests
and pip
install
pprint
.
The following is the structure of the JSON file that is created when using the collections API of the Cleveland Museum of Art. There is a helpful info
section at the beginning, but we are interested in the data
section. This data does not fit nicely into a tabular data structure. There may be several citations
objects and several creators
objects for each collection object. I have abbreviated the JSON file to save space:
{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, "data": [ { "id": 165157, "accession_number": "2007.158", "title": "Fulton and Nostrand", "creation_date": "1958", "citations": [ { "citation": "Annual Exhibition: Sculpture, Paintings...", "page_number": "Unpaginated, [8],[12]", "url": null }, { "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...", "page_number": "P. 60", "url": null }] "creators": [ { "description": "Jacob Lawrence (American, 1917-2000)", "extent": null, "qualifier": null, "role": "artist", "birth_year": "1917", "death_year": "2000" } ] }
Note
The API used in this recipe is provided by the Cleveland Museum of Art. It is available for public use at https://openaccess-api.clevelandart.org/.
How to do it...
Create a DataFrame from the museum's collections data with one row for each citation
, and the title
and creation_date
duplicated:
- Import the
json
,requests
, andpprint
libraries.We need the
requests
library to use an API to retrieve JSON data.pprint
improves the display of lists and dictionaries:>>> import pandas as pd >>> import numpy as np >>> import json >>> import pprint >>> import requests
- Use an API to load the JSON data.
Make a
get
request to the collections API of the Cleveland Museum of Art. Use the query string to indicate that you just want collections from African-American artists. Display the first collection item. I have truncated the output for the first item to save space:>>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists") >>> camcollections = json.loads(response.text) >>> print(len(camcollections['data'])) 778 >>> pprint.pprint(camcollections['data'][0]) {'accession_number': '2007.158', 'catalogue_raisonne': None, 'citations': [{'citation': 'Annual Exhibition: Sculpture...', 'page_number': 'Unpaginated, [8],[12]', 'url': None}, {'citation': '"Moscow to See Modern U.S....', 'page_number': 'P. 60', 'url': None}] 'collection': 'American - Painting', 'creation_date': '1958', 'creators': [{'biography': 'Jacob Lawrence (born 1917)...', 'birth_year': '1917', 'description': 'Jacob Lawrence (American...)', 'role': 'artist'}], 'type': 'Painting'}
- Flatten the JSON data.
Create a DataFrame from the JSON data using the
json_normalize
method. Indicate that the number of citations will determine the number of rows, and thataccession_number
,title
,creation_date
,collection
,creators
, andtype
will be repeated. Observe that the data has been flattened by displaying the first two observations, transposing them with the.T
option to make it easier to view:>>> camcollectionsdf=pd.json_normalize(camcollections['data'],/ 'citations',['accession_number','title','creation_date',/ 'collection','creators','type']) >>> camcollectionsdf.head(2).T 0 1 citation Annual Exhibiti... "Moscow to See Modern... page_number Unpaginated, P. 60 url None None accession_number 2007.158 2007.158 title Fulton and No... Fulton and No... creation_date 1958 1958 collection American - Pa... American - Pa... creators [{'description': 'J... [{'description': 'J... type Painting Painting
- Pull the
birth_year
value fromcreators
:>>> creator = camcollectionsdf[:1].creators[0] >>> type(creator[0]) <class 'dict'> >>> pprint.pprint(creator) [{'biography': 'Jacob Lawrence (born 1917) has been a prominent art...', 'birth_year': '1917', 'death_year': '2000', 'description': 'Jacob Lawrence (American, 1917-2000)', 'extent': None, 'name_in_original_language': None, 'qualifier': None, 'role': 'artist'}] >>> camcollectionsdf['birthyear'] = camcollectionsdf.\ ... creators.apply(lambda x: x[0]['birth_year']) >>> camcollectionsdf.birthyear.value_counts().\ ... sort_index().head() 1821 18 1886 2 1888 1 1892 13 1899 17 Name: birthyear, dtype: int64
This gives us a pandas DataFrame with one row for each citation
for each collection item, with the collection information (title
, creation_date
, and so on) duplicated.
How it works…
We work with a much more interesting JSON file in this recipe than in the previous one. Each object in the JSON file is an item in the collection of the Cleveland Museum of Art. Nested within each collection item are one or more citations. The only way to capture this information in a tabular DataFrame is to flatten it. There are also one or more dictionaries for creators of the collection item (the artist or artists). That dictionary (or dictionaries) contains the birth_year
value that we want.
We want one row for every citation for all collection items. To understand this, imagine that we are working with relational data and have a collections table and a citations table, and that we are doing a one-to-many merge from collections to citations. We do something similar with json_normalize
by using citations as the second parameter. That tells json_normalize
to create one row for each citation and use the key values in each citation dictionary – for citation
, page_number
, and url
– as data values.
The third parameter in the call to json_normalize
has the list of column names for the data that will be repeated with each citation. Notice that access_number
, title
, creation_date
, collection
, creators
, and type
are repeated in observations one and two. Citation
and page_number
change. (url
is the same value for the first and second citations. Otherwise, it would also change.)
This still leaves us with the problem of the creators dictionaries (there can be more than one creator). When we ran json_normalize
it grabbed the value for each key we indicated (in the third parameter) and stored it in the data for that column and row, whether that value was simple text or a list of dictionaries, as is the case for creators
. We take a look at the first (and in this case, only) creators
item for the first collections row in step 10, naming it creator
. (Note that the creators
list is duplicated across all citations
for a collection item, just as the values for title
, creation_date
, and so on are.)
We want the birth year for the first creator for each collection item, which can be found at creator[0]['birth_year']
. To create a birthyear
series using this, we use apply
and a lambda
function:
>>> camcollectionsdf['birthyear'] = camcollectionsdf.\ ... creators.apply(lambda x: x[0]['birth_year'])
We take a closer look at lambda functions in Chapter 6, Cleaning and Exploring Data with Series Operations. Here, it is helpful to think of the x
as representing the creators
series, so x[0]
gives us the list item we want, creators[0]
. We grab the value from the birth_year
key.
There's more…
You may have noticed that we left out some of the JSON returned by the API in our call to json_normalize
. The first parameter that we passed to json_normalize
was camcollections['data']
. Effectively, we ignore the info
object at the beginning of the JSON data. The information we want does not start until the data
object. This is not very different conceptually from the skiprows
parameter in the second recipe of the previous chapter. There is sometimes metadata like this at the beginning of JSON files.
See also
The preceding recipe demonstrates some useful techniques for doing data integrity checks without pandas, including list operations and comprehensions. Those are all relevant for the data in this recipe as well.