Spark evaluating history data
In this example, we combine the previous sections to look at some historical data and determine a number of useful attributes.
The historical data we are using is the guest list for the Jon Stewart television show. A typical record from the data looks as follows:
1999,actor,1/11/99,Acting,Michael J. Fox
This contains the year, the occupation of the guest, the date of appearance, a logical grouping of the occupations, and the name of the guest.
For our analysis, we will be looking at the number of appearances per year, the occupation that appears most frequently, and the personality who appears most frequently.
We will be using this script:
#Spark Daily Show Guests import pyspark import csv import operator import itertools import collections if not 'sc' in globals(): sc = pyspark.SparkContext() years = {} occupations = {} guests = {} #file header contains column descriptors: #YEAR, GoogleKnowledge_Occupation, Show, Group, Raw_Guest_List with open('daily_show_guests...