Spark text file analysis
In this example, we will look through a news article to determine some basic information from the article.
We will be using the following script against the 2600 raid news article from https://www.newsitem.com:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() #pull out sentences from article sentences = sc.textFile('2600raid.txt') \ .glom() \ .map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.split(".")) print(sentences.count(),"sentences") #find the bigrams in the sentences bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0, len(x)-1)]) print(bigrams.count(),"bigrams") #find the (10) most common bigrams frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \ .map(lambda x:(x[1], x[0])) \ .sortByKey(False) frequent_bigrams.take(10)
The code reads in the article and splits it into sentences
, as determined by the appearance of a period. From there, the code maps out the...