Analyzing big data history files
In this example we will be using a larger .csv
file for analysis. Specifically, it's the CSV file of the daily show guests from https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv.
How to do it...
We can use the following script:
import pyspark import csv import operator import itertools import collections import io if not 'sc' in globals(): sc = pyspark.SparkContext() years = {} occupations = {} guests = {} #The file header contains these column descriptors #YEAR,GoogleKnowlege_Occupation,Show,Group,Raw_Guest_List with open('daily_show_guests.csv', newline='') as csvfile: reader = csv.DictReader(csvfile, delimiter=',', quotechar='|') try: for row in reader: #track how many shows occurred in the year year = row['YEAR'] if year in years: years[year] = years[year] + 1 else: years[year] = 1 # what...