In this chapter, we have introduced the world of text analytics using Spark ML with emphasis on text classification. We have learned about Transformers and Estimators. We have seen how Tokenizers can be used to break sentences into words, how to remove stop words, and generate n-grams. We also saw how to implement HashingTF and IDF to generate TF-IDF-based features. We also looked at Word2Vec to convert sequences of words into vectors.
Then, we also looked at LDA, a popular technique used to generate topics from documents without knowing much about the actual text. Finally, we implemented text classification on the set of 10k tweets from the Twitter dataset to see how it all comes together using Transformers, Estimators, and the Logistic Regression model to perform binary classification.
In the next chapter, we will dig even deeper toward tuning Spark applications for...