Analyzing big-text data
We can run an analysis on large text streams, such as news, articles, to attempt to glean important themes. Here we are pulling out bigrams—combinations of two words—that appear in sequence throughout the article.
How to do it...
For this example, I am using text from an online article from Atlantic Monthly called The World Might Be Better Off Without College for Everyone at https://www.theatlantic.com/magazine/archive/2018/01/whats-college-good-for/546590/.
I am using this script:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() sentences = sc.textFile('B09656_09_article.txt') \ .glom() \ .map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.split(".")) print(sentences.count(),"sentences") bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)]) print(bigrams.count(),"bigrams") frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \ .map(lambda x:(x[1],x[0])) \ .sortByKey...