Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Python Data Cleaning Cookbook
Python Data Cleaning Cookbook

Python Data Cleaning Cookbook: Modern techniques and Python tools to detect and remove dirty data and extract key insights

Arrow left icon
Profile Icon Michael B Walker Profile Icon Michael Walker
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8 (28 Ratings)
Paperback Dec 2020 436 pages 1st Edition
eBook
S$12.99 S$52.99
Paperback
S$66.99
Subscription
Free Trial
Arrow left icon
Profile Icon Michael B Walker Profile Icon Michael Walker
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8 (28 Ratings)
Paperback Dec 2020 436 pages 1st Edition
eBook
S$12.99 S$52.99
Paperback
S$66.99
Subscription
Free Trial
eBook
S$12.99 S$52.99
Paperback
S$66.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Python Data Cleaning Cookbook

Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas

This chapter continues our work on importing data from a variety of sources, and the initial checks we should do on the data after importing it. Gradually, over the last 25 years, data analysts have found that they increasingly need to work with data in non-tabular, semi-structured forms. Sometimes they even create and persist data in those forms themselves. We work with a common alternative to traditional tabular datasets in this chapter, JSON, but the general concepts can be extended to XML and NoSQL data stores such as MongoDB. We also go over common issues that occur when scraping data from websites.

In this chapter, we will work through the following recipes:

  • Importing simple JSON data
  • Importing more complicated JSON data from an API
  • Importing data from web pages
  • Persisting JSON data

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook

Importing simple JSON data

JavaScript Object Notation (JSON) has turned out to be an incredibly useful standard for transferring data from one machine, process, or node to another. Often a client sends a data request to a server, upon which that server queries the data in the local storage and then converts it from something like a SQL Server table or tables into JSON, which the client can consume. This is sometimes complicated further by the first server (say, a web server) forwarding the request to a database server. JSON facilitates this, as does XML, by doing the following:

  • Being readable by humans
  • Being consumable by most client devices
  • Not being limited in structure

JSON is quite flexible, which means that it can accommodate just about anything. The structure can even change within a JSON file, so different keys might be present at different points. For example, the file might begin with some explanatory keys that have a very different structure than the remaining data keys. Or some keys might be present in some cases, but not others. We go over some approaches for dealing with that messiness (uh, I mean flexibility).

Getting ready

We are going to work with data on news stories about political candidates in this recipe. This data is made available for public use at dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0ZLHOK. I have combined the JSON files there into one file and randomly selected 60,000 news stories from the combined data. This sample (allcandidatenewssample.json) is available in the GitHub repository of this book.

We will do a little work with list and dictionary comprehensions in this recipe. DataCamp has good guides to list comprehensions (https://www.datacamp.com/community/tutorials/python-list-comprehension) and dictionary comprehensions (https://www.datacamp.com/community/tutorials/python-dictionary-comprehension) if you are feeling a little rusty.

How to do it…

We will import a JSON file into pandas after doing some data checking and cleaning:

  1. Import the json and pprint libraries.

    pprint improves the display of the lists and dictionaries that are returned when we load JSON data:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> from collections import Counter
  2. Load the JSON data and look for potential issues.

    Use the json load method to return data on news stories about political candidates. load returns a list of dictionaries. Use len to get the size of the list, which is the total number of news stories in this case. (Each list item is a dictionary with keys for the title, source, and so on, and their respective values.) Use pprint to display the first two dictionaries. Get the value from the source key for the first list item:

    >>> with open('data/allcandidatenewssample.json') as f:
    ...   candidatenews = json.load(f)
    ... 
    >>> len(candidatenews)
    60000
    >>> pprint.pprint(candidatenews[0:2])
    [{'date': '2019-12-25 10:00:00',
      'domain': 'www.nbcnews.com',
      'panel_position': 1,
      'query': 'Michael Bloomberg',
      'source': 'NBC News',
      'story_position': 6,
      'time': '18 hours ago',
      'title': 'Bloomberg cuts ties with company using prison inmates to make '
               'campaign calls',
      'url': 'https://www.nbcnews.com/politics/2020-election/bloomberg-cuts-ties-company-using-prison-inmates-make-campaign-calls-n1106971'},
     {'date': '2019-11-09 08:00:00',
      'domain': 'www.townandcountrymag.com',
      'panel_position': 1,
      'query': 'Amy Klobuchar',
      'source': 'Town & Country Magazine',
      'story_position': 3,
      'time': '18 hours ago',
      'title': "Democratic Candidates React to Michael Bloomberg's Potential Run",
      'url': 'https://www.townandcountrymag.com/society/politics/a29739854/michael-bloomberg-democratic-candidates-campaign-reactions/'}]
    >>> pprint.pprint(candidatenews[0]['source'])
    'NBC News'
  3. Check for differences in the structure of the dictionaries.

    Use Counter to check for any dictionaries in the list with fewer than, or more than, the nine keys that is normal. Look at a few of the dictionaries with almost no data (those with just two keys) before removing them. Confirm that the remaining list of dictionaries has the expected length – 60000-2382=57618:

    >>> Counter([len(item) for item in candidatenews])
    Counter({9: 57202, 2: 2382, 10: 416})
    >>> pprint.pprint(next(item for item in candidatenews if len(item)<9))
    {'date': '2019-09-11 18:00:00', 'reason': 'Not collected'}
    >>> pprint.pprint(next(item for item in candidatenews if len(item)>9))
    {'category': 'Satire',
     'date': '2019-08-21 04:00:00',
     'domain': 'politics.theonion.com',
     'panel_position': 1,
     'query': 'John Hickenlooper',
     'source': 'Politics | The Onion',
     'story_position': 8,
     'time': '4 days ago',
     'title': ''And Then There Were 23,' Says Wayne Messam Crossing Out '
              'Hickenlooper Photo \n'
              'In Elaborate Grid Of Rivals',
     'url': 'https://politics.theonion.com/and-then-there-were-23-says-wayne-messam-crossing-ou-1837311060'}
    >>> pprint.pprint([item for item in candidatenews if len(item)==2][0:10])
    [{'date': '2019-09-11 18:00:00', 'reason': 'Not collected'},
     {'date': '2019-07-24 00:00:00', 'reason': 'No Top stories'},
    ... 
     {'date': '2019-01-03 00:00:00', 'reason': 'No Top stories'}]
    >>> candidatenews = [item for item in candidatenews if len(item)>2]
    >>> len(candidatenews)
    57618
  4. Generate counts from the JSON data.

    Get the dictionaries just for Politico (a website that covers political news) and display a couple of dictionaries:

    >>> politico = [item for item in candidatenews if item["source"] == "Politico"]
    >>> len(politico)
    2732
    >>> pprint.pprint(politico[0:2])
    [{'date': '2019-05-18 18:00:00',
      'domain': 'www.politico.com',
      'panel_position': 1,
      'query': 'Marianne Williamson',
      'source': 'Politico',
      'story_position': 7,
      'time': '1 week ago',
      'title': 'Marianne Williamson reaches donor threshold for Dem debates',
      'url': 'https://www.politico.com/story/2019/05/09/marianne-williamson-2020-election-1315133'},
     {'date': '2018-12-27 06:00:00',
      'domain': 'www.politico.com',
      'panel_position': 1,
      'query': 'Julian Castro',
      'source': 'Politico',
      'story_position': 1,
      'time': '1 hour ago',
      'title': "O'Rourke and Castro on collision course in Texas",
      'url': 'https://www.politico.com/story/2018/12/27/orourke-julian-castro-collision-texas-election-1073720'}]
  5. Get the source data and confirm that it has the anticipated length.

    Show the first few items in the new sources list. Generate a count of news stories by source and display the 10 most popular sources. Notice that stories from The Hill can have TheHill (without a space) or The Hill as the value for source:

    >>> sources = [item.get('source') for item in candidatenews]
    >>> type(sources)
    <class 'list'>
    >>> len(sources)
    57618
    >>> sources[0:5]
    ['NBC News', 'Town & Country Magazine', 'TheHill', 'CNBC.com', 'Fox News']
    >>> pprint.pprint(Counter(sources).most_common(10))
    [('Fox News', 3530),
     ('CNN.com', 2750),
     ('Politico', 2732),
     ('TheHill', 2383),
     ('The New York Times', 1804),
     ('Washington Post', 1770),
     ('Washington Examiner', 1655),
     ('The Hill', 1342),
     ('New York Post', 1275),
     ('Vox', 941)]
  6. Fix any errors in the values in the dictionary.

    Fix the source values for The Hill. Notice that The Hill is now the most frequent source for news stories:

    >>> for newsdict in candidatenews:
    ...     newsdict.update((k, "The Hill") for k, v in newsdict.items()
    ...      if k == "source" and v == "TheHill")
    ... 
    >>> sources = [item.get('source') for item in candidatenews]
    >>> pprint.pprint(Counter(sources).most_common(10))
    [('The Hill', 3725),
     ('Fox News', 3530),
     ('CNN.com', 2750),
     ('Politico', 2732),
     ('The New York Times', 1804),
     ('Washington Post', 1770),
     ('Washington Examiner', 1655),
     ('New York Post', 1275),
     ('Vox', 941),
     ('Breitbart', 799)]
  7. Create a pandas DataFrame.

    Pass the JSON data to the pandas DataFrame method. Convert the date column to a datetime data type:

    >>> candidatenewsdf = pd.DataFrame(candidatenews)
    >>> candidatenewsdf.dtypes
    title             object
    url               object
    source            object
    time              object
    date              object
    query             object
    story_position     int64
    panel_position    object
    domain            object
    category          object
    dtype: object
  8. Confirm that we are getting the expected values for source.

    Also, rename the date column:

    >>> candidatenewsdf.rename(columns={'date':'storydate'}, inplace=True)
    >>> candidatenewsdf.storydate = candidatenewsdf.storydate.astype('datetime64[ns]')
    >>> candidatenewsdf.shape
    (57618, 10)
    >>> candidatenewsdf.source.value_counts(sort=True).head(10)
    The Hill               3725
    Fox News               3530
    CNN.com                2750
    Politico               2732
    The New York Times     1804
    Washington Post        1770
    Washington Examiner    1655
    New York Post          1275
    Vox                     941
    Breitbart               799
    Name: source, dtype: int64

We now have a pandas DataFrame with only the news stories where there is meaningful data, and with the values for source fixed.

How it works…

The json.load method returns a list of dictionaries. This makes it possible to use a number of familiar tools when working with this data: list methods, slicing, list comprehensions, dictionary updates, and so on. There are times, maybe when you just have to populate a list or count the number of individuals in a given category, when there is no need to use pandas.

In steps 2 to 6, we use list methods to do many of the same checks we have done with pandas in previous recipes. In step 3 we use Counter with a list comprehension (Counter([len(item) for item in candidatenews])) to get the number of keys in each dictionary. This tells us that there are 2,382 dictionaries with just 2 keys and 416 with 10. We use next to look for an example of dictionaries with fewer than 9 keys or more than 9 keys to get a sense of the structure of those items. We use slicing to show 10 dictionaries with 2 keys to see if there is any data in those dictionaries. We then select only those dictionaries with more than 2 keys.

In step 4 we create a subset of the list of dictionaries, one that just has source equal to Politico, and take a look at a couple of items. We then create a list with just the source data and use Counter to list the 10 most common sources in step 5.

Step 6 demonstrates how to replace key values conditionally in a list of dictionaries. In this case, we update the key value to The Hill whenever key (k) is source and value (v) is TheHill. The for k, v in newsdict.items() section is the unsung hero of this line. It loops through all key/value pairs for all dictionaries in candidatenews.

It is easy to create a pandas DataFrame by passing the list of dictionaries to the pandas DataFrame method. We do this in step 7. The main complication is that we need to convert the date column from a string to a date, since dates are just strings in JSON.

There's more…

In steps 5 and 6 we use item.get('source') instead of item['source']. This is handy when there might be missing keys in a dictionary. get returns None when the key is missing, but we can use an optional second argument to specify a value to return.

I renamed the date column to storydate in step 8. This is not necessary, but is a good idea. Not only does date not tell you anything about what the dates actually represent, it is also so generic a column name that it is bound to cause problems at some point.

The news stories data fits nicely into a tabular structure. It makes sense to represent each list item as one row, and the key/value pairs as columns and column values for that row. There are no significant complications, such as key values that are themselves lists of dictionaries. Imagine an authors key for each story with a list item for each author as the key value, and that list item is a dictionary of information about the author. This is not at all unusual when working with JSON data in Python. The next recipe shows how to work with data structured in this way.

Importing more complicated JSON data from an API

In the previous recipe, we discussed one significant advantage (and challenge) of working with JSON data – its flexibility. A JSON file can have just about any structure its authors can imagine. This often means that this data does not have the tabular structure of the data sources we have discussed so far, and that pandas DataFrames have. Often, analysts and application developers use JSON precisely because it does not insist on a tabular structure. I know I do!

Retrieving data from multiple tables often requires us to do a one-to-many merge. Saving that data to one table or file means duplicating data on the "one" side of the one-to-many relationship. For example, student demographic data is merged with data on the courses studied, and the demographic data is repeated for each course. With JSON, duplication is not required to capture these items of data in one file. We can have data on the courses studied nested within the data for each student.

But doing analysis with JSON structured in this way will eventually require us to either: 1) manipulate the data in a very different way than we are used to doing; or 2) convert the JSON to a tabular form. We examine the first approach in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning. This recipe takes the second approach. It uses a very handy tool for converting selected nodes of JSON to a tabular structure – json_normalize.

We first use an API to get JSON data because that is how JSON is frequently consumed. One advantage of retrieving the data with an API, rather than working from a file we have saved locally, is that it is easier to rerun our code when the source data is refreshed.

Getting ready

This recipe assumes you have the requests and pprint libraries already installed. If they are not installed, you can install them with pip. From the terminal (or PowerShell in Windows), enter pip install requests and pip install pprint.

The following is the structure of the JSON file that is created when using the collections API of the Cleveland Museum of Art. There is a helpful info section at the beginning, but we are interested in the data section. This data does not fit nicely into a tabular data structure. There may be several citations objects and several creators objects for each collection object. I have abbreviated the JSON file to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

Note

The API used in this recipe is provided by the Cleveland Museum of Art. It is available for public use at https://openaccess-api.clevelandart.org/.

How to do it...

Create a DataFrame from the museum's collections data with one row for each citation, and the title and creation_date duplicated:

  1. Import the json, requests, and pprint libraries.

    We need the requests library to use an API to retrieve JSON data. pprint improves the display of lists and dictionaries:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
  2. Use an API to load the JSON data.

    Make a get request to the collections API of the Cleveland Museum of Art. Use the query string to indicate that you just want collections from African-American artists. Display the first collection item. I have truncated the output for the first item to save space:

    >>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    778
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...)',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Flatten the JSON data.

    Create a DataFrame from the JSON data using the json_normalize method. Indicate that the number of citations will determine the number of rows, and that accession_number, title, creation_date, collection, creators, and type will be repeated. Observe that the data has been flattened by displaying the first two observations, transposing them with the .T option to make it easier to view:

    >>> camcollectionsdf=pd.json_normalize(camcollections['data'],/
     'citations',['accession_number','title','creation_date',/
     'collection','creators','type'])
    >>> camcollectionsdf.head(2).T
                              0                       1
    citation        Annual Exhibiti...  "Moscow to See Modern...
    page_number           Unpaginated,                     P. 60
    url                          None                       None
    accession_number         2007.158                   2007.158
    title            Fulton and No...           Fulton and No...
    creation_date                1958                       1958
    collection       American - Pa...           American - Pa...
    creators   [{'description': 'J...     [{'description': 'J...
    type                     Painting                   Painting
  4. Pull the birth_year value from creators:
    >>> creator = camcollectionsdf[:1].creators[0]
    >>> type(creator[0])
    <class 'dict'>
    >>> pprint.pprint(creator)
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent art...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'extent': None,
      'name_in_original_language': None,
      'qualifier': None,
      'role': 'artist'}]
    >>> camcollectionsdf['birthyear'] = camcollectionsdf.\
    ...   creators.apply(lambda x: x[0]['birth_year'])
    >>> camcollectionsdf.birthyear.value_counts().\
    ...   sort_index().head()
    1821    18
    1886     2
    1888     1
    1892    13
    1899    17
    Name: birthyear, dtype: int64

This gives us a pandas DataFrame with one row for each citation for each collection item, with the collection information (title, creation_date, and so on) duplicated.

How it works…

We work with a much more interesting JSON file in this recipe than in the previous one. Each object in the JSON file is an item in the collection of the Cleveland Museum of Art. Nested within each collection item are one or more citations. The only way to capture this information in a tabular DataFrame is to flatten it. There are also one or more dictionaries for creators of the collection item (the artist or artists). That dictionary (or dictionaries) contains the birth_year value that we want.

We want one row for every citation for all collection items. To understand this, imagine that we are working with relational data and have a collections table and a citations table, and that we are doing a one-to-many merge from collections to citations. We do something similar with json_normalize by using citations as the second parameter. That tells json_normalize to create one row for each citation and use the key values in each citation dictionary – for citation, page_number, and url – as data values.

The third parameter in the call to json_normalize has the list of column names for the data that will be repeated with each citation. Notice that access_number, title, creation_date, collection, creators, and type are repeated in observations one and two. Citation and page_number change. (url is the same value for the first and second citations. Otherwise, it would also change.)

This still leaves us with the problem of the creators dictionaries (there can be more than one creator). When we ran json_normalize it grabbed the value for each key we indicated (in the third parameter) and stored it in the data for that column and row, whether that value was simple text or a list of dictionaries, as is the case for creators. We take a look at the first (and in this case, only) creators item for the first collections row in step 10, naming it creator. (Note that the creators list is duplicated across all citations for a collection item, just as the values for title, creation_date, and so on are.)

We want the birth year for the first creator for each collection item, which can be found at creator[0]['birth_year']. To create a birthyear series using this, we use apply and a lambda function:

>>> camcollectionsdf['birthyear'] = camcollectionsdf.\
...   creators.apply(lambda x: x[0]['birth_year'])

We take a closer look at lambda functions in Chapter 6, Cleaning and Exploring Data with Series Operations. Here, it is helpful to think of the x as representing the creators series, so x[0] gives us the list item we want, creators[0]. We grab the value from the birth_year key.

There's more…

You may have noticed that we left out some of the JSON returned by the API in our call to json_normalize. The first parameter that we passed to json_normalize was camcollections['data']. Effectively, we ignore the info object at the beginning of the JSON data. The information we want does not start until the data object. This is not very different conceptually from the skiprows parameter in the second recipe of the previous chapter. There is sometimes metadata like this at the beginning of JSON files.

See also

The preceding recipe demonstrates some useful techniques for doing data integrity checks without pandas, including list operations and comprehensions. Those are all relevant for the data in this recipe as well.

Importing data from web pages

We use Beautiful Soup in this recipe to scrape data from a web page and load that data into pandas. Web scraping is very useful when there is data at a website that is updated regularly, but there is no API. We can rerun our code to generate new data whenever the page is updated.

Unfortunately, the web scrapers we build can be broken when the structure of the targeted page changes. That is less likely to happen with APIs because they are designed for data exchange, and carefully curated with that end in mind. The priority for most web designers is the quality of the display of information, not the reliability and ease of data exchange. This causes data cleaning challenges unique to web scraping, including HTML elements that house the data being in surprising and changing locations, formatting tags that obfuscate the underlying data, and explanatory text that aid data interpretation being difficult to retrieve. In addition to these challenges, scraping presents data cleaning issues that are familiar, such as changing data types in columns, less than ideal headings, and missing values. We deal with data issues that occur most frequently in this recipe.

Getting ready

You will need Beautiful Soup installed to run the code in this recipe. You can install it with pip by entering pip install beautifulsoup4 in a terminal window or Windows PowerShell.

We will scrape data from a web page, find the following table in that page, and load it into a pandas DataFrame:

Figure 2.1 – COVID-19 data from six countries

Figure 2.1 – COVID-19 data from six countries

Note

I created this web page, http://www.alrb.org/datacleaning/covidcaseoutliers.html, based on COVID-19 data for public use from Our World in Data, available at https://ourworldindata.org/coronavirus-source-data.

How to do it…

We scrape the COVID data from the website and do some routine data checks:

  1. Import the pprint, requests, and BeautifulSoup libraries:
    >>> import pandas as pd
    >>> import numpy as np
    >>> import json
    >>> import pprint
    >>> import requests
    >>> from bs4 import BeautifulSoup
  2. Parse the web page and get the header row of the table.

    Use Beautiful Soup's find method to get the table we want and then use find_all to retrieve the elements nested within the th elements for that table. Create a list of column labels based on the text of the th rows:

    >>> webpage = requests.get("http://www.alrb.org/datacleaning/covidcaseoutliers.html")
    >>> bs = BeautifulSoup(webpage.text, 'html.parser')
    >>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
    >>> type(theadrows)
    <class 'bs4.element.ResultSet'>
    >>> labelcols = [j.get_text() for j in theadrows]
    >>> labelcols[0] = "rowheadings"
    >>> labelcols
    ['rowheadings', 'Cases', 'Deaths', 'Cases per Million', 'Deaths per Million', 'population', 'population_density', 'median_age', 'gdp_per_capita', 'hospital_beds_per_100k']
  3. Get the data from the table cells.

    Find all of the table rows for the table we want. For each table row, find the th element and retrieve the text. We will use that text for our row labels. Also, for each row, find all the td elements (the table cells with the data) and save text from all of them in a list. This gives us datarows, which has all the numeric data in the table. (You can confirm that it matches the table from the web page.) We then insert the labelrows list (which has the row headings) at the beginning of each list in datarows:

    >>> rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')
    >>> datarows = []
    >>> labelrows = []
    >>> for row in rows:
    ...   rowlabels = row.find('th').get_text()
    ...   cells = row.find_all('td', {'class':'data'})
    ...   if (len(rowlabels)>3):
    ...     labelrows.append(rowlabels)
    ...   if (len(cells)>0):
    ...     cellvalues = [j.get_text() for j in cells]
    ...     datarows.append(cellvalues)
    ... 
    >>> pprint.pprint(datarows[0:2])
    [['9,394', '653', '214', '15', '43,851,043', '17', '29', '13,914', '1.9'],
     ['16,642', '668', '1848', '74', '9,006,400', '107', '44', '45,437', '7.4']]
    >>> pprint.pprint(labelrows[0:2])
    ['Algeria', 'Austria']
    >>> 
    >>> for i in range(len(datarows)):
    ...   datarows[i].insert(0, labelrows[i])
    ... 
    >>> pprint.pprint(datarows[0:1])
    [['Algeria','9,394','653','214','15','43,851,043','17','29','13,914','1.9']]
  4. Load the data into pandas.

    Pass the datarows list to the DataFrame method of pandas. Notice that all data is read into pandas with the object data type, and that some data has values that cannot be converted into numeric values in their current form (due to the commas):

    >>> totaldeaths = pd.DataFrame(datarows, columns=labelcols)
    >>> totaldeaths.head()
      rowheadings    Cases Deaths  ... median_age gdp_per_capita  \
    0     Algeria    9,394    653  ...         29         13,914   
    1     Austria   16,642    668  ...         44         45,437   
    2  Bangladesh   47,153    650  ...         28          3,524   
    3     Belgium   58,381   9467  ...         42         42,659   
    4      Brazil  514,849  29314  ...         34         14,103   
    >>> totaldeaths.dtypes
    rowheadings               object
    Cases                     object
    Deaths                    object
    Cases per Million         object
    Deaths per Million        object
    population                object
    population_density        object
    median_age                object
    gdp_per_capita            object
    hospital_beds_per_100k    object
    dtype: object
  5. Fix the column names and convert the data to numeric values.

    Remove spaces from column names. Remove all non-numeric data from the first columns with data, including the commas (str.replace("[^0-9]",""). Convert to numeric values, except for the rowheadings column:

    >>> totaldeaths.columns = totaldeaths.columns.str.replace(" ", "_").str.lower()
    >>> for col in totaldeaths.columns[1:-1]:
    ...   totaldeaths[col] = totaldeaths[col].\
    ...     str.replace("[^0-9]","").astype('int64')
    ... 
    >>> totaldeaths['hospital_beds_per_100k'] = totaldeaths['hospital_beds_per_100k'].astype('float')
    >>> totaldeaths.head()
      rowheadings   cases  deaths  ...  median_age  gdp_per_capita  \
    0     Algeria    9394     653  ...          29           13914   
    1     Austria   16642     668  ...          44           45437   
    2  Bangladesh   47153     650  ...          28            3524   
    3     Belgium   58381    9467  ...          42           42659   
    4      Brazil  514849   29314  ...          34           14103   
    >>> totaldeaths.dtypes
    rowheadings                object
    cases                       int64
    deaths                      int64
    cases_per_million           int64
    deaths_per_million          int64
    population                  int64
    population_density          int64
    median_age                  int64
    gdp_per_capita              int64
    hospital_beds_per_100k    float64
    dtype: object

We have now created a pandas DataFrame from an html table.

How it works…

Beautiful Soup is a very useful tool for finding specific HTML elements in a web page and retrieving text from them. You can get one HTML element with find and get one or more with find_all. The first argument for both find and find_all is the HTML element to get. The second argument takes a Python dictionary of attributes. You can retrieve text from all of the HTML elements you find with get_text.

Some amount of looping is usually necessary to process the elements and text, as with step 2 and step 3. These two statements in step 2 are fairly typical:

>>> theadrows = bs.find('table', {'id':'tblDeaths'}).thead.find_all('th')
>>> labelcols = [j.get_text() for j in theadrows]

The first statement finds all the th elements we want and creates a Beautiful Soup result set called theadrows from the elements it found. The second statement iterates over the theadrows Beautiful Soup result set using the get_text method to get the text from each element, and stores it in the labelcols list.

Step 3 is a little more involved, but makes use of the same Beautiful Soup methods. We find all of the table rows (tr) in the target table (rows = bs.find('table', {'id':'tblDeaths'}).tbody.find_all('tr')). We then iterate over each of those rows, finding the th element and getting the text in that element (rowlabels = row.find('th').get_text()). We also find all of the table cells (td) for each row (cells = row.find_all('td', {'class':'data'}) and get the text from all table cells (cellvalues = [j.get_text() for j in cells]). Note that this code is dependent on the class of the td elements being data. Finally, we insert the row labels we get from the th elements at the beginning of each list in datarows:

>>> for i in range(len(datarows)):
...   datarows[i].insert(0, labelrows[i])

In step 4, we use the DataFrame method to load the list we created in steps 2 and 3 into pandas. We then do some cleaning similar to what we have done in previous recipes in this chapter. We use string replace to remove spaces from column names and to remove all non-numeric data, including commas, from what are otherwise valid numeric values. We convert all columns, except for the rowheadings column, to numeric.

There's more…

Our scraping code is dependent on several aspects of the web page's structure not changing: the ID of the main table, the presence of th tags with column and row labels, and the td elements continuing to have their class equal to data. The good news is that if the structure of the web page does change, this will likely only affect the find and find_all calls. The rest of the code would not need to change.

Persisting JSON data

There are several reasons why we might want to serialize a JSON file:

  • We may have retrieved the data with an API, but need to keep a snapshot of the data.
  • The data in the JSON file is relatively static and informs our data cleaning and analysis over multiple phases of a project.
  • We might decide that the flexibility of a schema-less format such as JSON helps us solve many data cleaning and analysis problems.

It is worth highlighting this last reason to use JSON – that it can solve many data problems. Although tabular data structures clearly have many benefits, particularly for operational data, they are often not the best way to store data for analysis purposes. In preparing data for analysis, a substantial amount of time is spent either merging data from different tables or dealing with data redundancy when working with flat files. Not only are these processes time consuming, but every merge or reshaping leaves the door open to a data error of broad scope. This can also mean that we end up paying too much attention to the mechanics of manipulating data and too little to the conceptual issues at the core of our work.

We return to the Cleveland Museum of Art collections data in this recipe. There are at least three possible units of analysis for this data file – the collection item level, the creator level, and the citation level. JSON allows us to nest citations and creators within collections. (You can examine the structure of the JSON file in the Getting ready section of this recipe.) This data cannot be persisted in a tabular structure without flattening the file, which we did in an earlier recipe in this chapter. In this recipe, we will use two different methods to persist JSON data, each with its own advantages and disadvantages.

Getting ready

We will be working with data on the Cleveland Museum of Art's collection of works by African-American artists. The following is the structure of the JSON data returned by the API. It has been abbreviated to save space:

{"info": { "total": 778, "parameters": {"african_american_artists": "" }}, 
"data": [
{
"id": 165157, 
"accession_number": "2007.158", 
"title": "Fulton and Nostrand", 
"creation_date": "1958", 
"citations": [
  {
   "citation": "Annual Exhibition: Sculpture, Paintings...", 
   "page_number": "Unpaginated, [8],[12]", 
   "url": null
   }, 
  {
   "citation": "\"Moscow to See Modern U.S. Art,\"<em> New York...",   
   "page_number": "P. 60",
   "url": null
  }]
"creators": [
      {
     "description": "Jacob Lawrence (American, 1917-2000)", 
     "extent": null, 
     "qualifier": null, 
     "role": "artist", 
     "birth_year": "1917", 
     "death_year": "2000"
     }
  ]
 }

How to do it...

We will serialize the JSON data using two different methods:

  1. Load the pandas, json, pprint, requests, and msgpack libraries:
    >>> import pandas as pd
    >>> import json
    >>> import pprint
    >>> import requests
    >>> import msgpack
  2. Load the JSON data from an API. I have abbreviated the JSON output:
    >>> response = requests.get("https://openaccess-api.clevelandart.org/api/artworks/?african_american_artists")
    >>> camcollections = json.loads(response.text)
    >>> print(len(camcollections['data']))
    778
    >>> pprint.pprint(camcollections['data'][0])
    {'accession_number': '2007.158',
     'catalogue_raisonne': None,
     'citations': [{'citation': 'Annual Exhibition: Sculpture...',
                    'page_number': 'Unpaginated, [8],[12]',
                    'url': None},
                   {'citation': '"Moscow to See Modern U.S....',
                    'page_number': 'P. 60',
                    'url': None}]
     'collection': 'American - Painting',
     'creation_date': '1958',
     'creators': [{'biography': 'Jacob Lawrence (born 1917)...',
                   'birth_year': '1917',
                   'description': 'Jacob Lawrence (American...',
                   'role': 'artist'}],
     'type': 'Painting'}
  3. Save and reload the JSON file using Python's json library.

    Persist the JSON data in human-readable form. Reload it from the saved file and confirm that it worked by retrieving the creators data from the first collections item:

    >>> with open("data/camcollections.json","w") as f:
    ...   json.dump(camcollections, f)
    ... 
    >>> with open("data/camcollections.json","r") as f:
    ...   camcollections = json.load(f)
    ... 
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent artist since...'
      'birth_year': '1917',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]
  4. Save and reload the JSON file using msgpack:
    >>> with open("data/camcollections.msgpack", "wb") as outfile:
    ...     packed = msgpack.packb(camcollections)
    ...     outfile.write(packed)
    ... 
    1586507
    >>> with open("data/camcollections.msgpack", "rb") as data_file:
    ...     msgbytes = data_file.read()
    ... 
    >>> camcollections = msgpack.unpackb(msgbytes)
    >>> pprint.pprint(camcollections['data'][0]['creators'])
    [{'biography': 'Jacob Lawrence (born 1917) has been a prominent...',
      'birth_year': '1917',
      'death_year': '2000',
      'description': 'Jacob Lawrence (American, 1917-2000)',
      'role': 'artist'}]

How it works…

We use the Cleveland Museum of Art's collections API to retrieve collections items. The african_american_artists flag in the query string indicates that we just want collections for those creators. json.loads returns a dictionary called info and a list of dictionaries called data. We check the length of the data list. This tells us that there are 778 items in collections. We then display the first item of collections to get a better look at the structure of the data. (I have abbreviated the JSON output.)

We save and then reload the data using Python's JSON library in step 3. The advantage of persisting the data in this way is that it keeps the data in human-readable form. Unfortunately, it has two disadvantages: saving takes longer than alternative serialization methods, and it uses more storage space.

In step 4, we use msgpack to persist our data. This is faster than Python's json library, and the saved file uses less space. Of course, the disadvantage is that the resulting JSON is binary rather than text-based.

There's more…

I use both methods for persisting JSON data in my work. When I am working with small amounts of data, and that data is relatively static, I prefer human-readable JSON. A great use case for this is the recipes in the previous chapter where we needed to create value labels.

I use msgpack when I am working with large amounts of data, where that data changes regularly. msgpack files are also great when you want to take regular snapshots of key tables in enterprise databases.

The Cleveland Museum of Art's collections data is similar in at least one important way to the data we work with every day. The unit of analysis frequently changes. Here we are looking at collections, citations, and creators. In our work, we might have to simultaneously look at students and courses, or households and deposits. An enterprise database system for the museum data would likely have separate collections, citations, and creators tables that we would eventually need to merge. The resulting merged file would have data redundancy issues that we would need to account for whenever we changed the unit of analysis.

When we alter our data cleaning process to work directly from JSON or parts of it, we end up eliminating a major source of errors. We do more data cleaning with JSON in the Classes that handle non-tabular data structures recipe in Chapter 10, User-Defined Functions and Classes to Automate Data Cleaning.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get well-versed with various data cleaning techniques to reveal key insights
  • Manipulate data of different complexities to shape them into the right form as per your business needs
  • Clean, monitor, and validate large data volumes to diagnose problems before moving on to data analysis

Description

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data. By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

Who is this book for?

This book is for anyone looking for ways to handle messy, duplicate, and poor data using different Python tools and techniques. The book takes a recipe-based approach to help you to learn how to clean and manage data. Working knowledge of Python programming is all you need to get the most out of the book.

What you will learn

  • Find out how to read and analyze data from a variety of sources
  • Produce summaries of the attributes of data frames, columns, and rows
  • Filter data and select columns of interest that satisfy given criteria
  • Address messy data issues, including working with dates and missing values
  • Improve your productivity in Python pandas by using method chaining
  • Use visualizations to gain additional insights and identify potential data issues
  • Enhance your ability to learn what is going on in your data
  • Build user-defined functions and classes to automate data cleaning

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 11, 2020
Length: 436 pages
Edition : 1st
Language : English
ISBN-13 : 9781800565661
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Dec 11, 2020
Length: 436 pages
Edition : 1st
Language : English
ISBN-13 : 9781800565661
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total S$ 225.97
Python Data Cleaning Cookbook
S$66.99
Data Engineering with Python
S$70.99
Hands-On Data Analysis with Pandas
S$87.99
Total S$ 225.97 Stars icon
Banner background image

Table of Contents

11 Chapters
Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas Chevron down icon Chevron up icon
Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas Chevron down icon Chevron up icon
Chapter 3: Taking the Measure of Your Data Chevron down icon Chevron up icon
Chapter 4: Identifying Missing Values and Outliers in Subsets of Data Chevron down icon Chevron up icon
Chapter 5: Using Visualizations for the Identification of Unexpected Values Chevron down icon Chevron up icon
Chapter 6: Cleaning and Exploring Data with Series Operations Chevron down icon Chevron up icon
Chapter 7: Fixing Messy Data when Aggregating Chevron down icon Chevron up icon
Chapter 8: Addressing Data Issues When Combining DataFrames Chevron down icon Chevron up icon
Chapter 9: Tidying and Reshaping Data Chevron down icon Chevron up icon
Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(28 Ratings)
5 star 82.1%
4 star 17.9%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Tej Jan 11, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The best part of this book is "how it works" section which explains the working behind the code. First, you start off with writing the code and the explanation of the code is in "how it works". In case you run into any bug, this section should help. This book gives a brief overview of different strategies for data cleaning. A good one time read.
Amazon Verified review Amazon
Roy Dec 26, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I am a University Professor of Business Analytics and my specialty is Data Science. I have rated this book with 5 stars because Python Data Cleaning Cookbook offers some educational journeys on data cleaning via some examples through the open-source programming of python. The book is excellent for data enthusiasts to experience data cleaning from the eye of an experience data scientist. I would not recommend the book as a definite guide for data cleaning, however, I would defintaly recommend it to people who would like to be exposed to real examples of data clening.
Amazon Verified review Amazon
ACEgolden Feb 02, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is a good start for data scientist or machine learning engineer that comes from more software engineer background. It breaks down the issues with data in real world, for examples, missing values and outliers. Furthermore provides receipt on how to deal with the issues, along with use cases and rough theory behind the code pieces. Readers can find this book as a quick hands-on toolbox to apply for their daily data science job for analyzing numerical values and categorical values. I would recommend this book.
Amazon Verified review Amazon
Susmithakolli Mar 17, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book is well structured. The aspects covered in this book are valuable. I started implementing some of these concepts in my job and it is making my work easy and efficient.
Amazon Verified review Amazon
Elsie Jan 27, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book offers insights that cannot be gained from a simple google search. The author not only provides sample code, "recipes," for how to clean your data with Python, but teaches an approach to data cleaning. And the best part is that it is written in an engaging and clear manner.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.