"Programs must be written for people to read, and only incidentally for machines to execute."
- Harold Abelson
In this chapter, we will discuss the wonderful field of text analytics using Spark ML. Text analytics is a wide area in machine learning and is useful in many use cases, such as sentiment analysis, chat bots, email spam detection, and natural language processing. We will learn how to use Spark for text analysis with a focus on use cases of text classification using a 10,000 sample set of Twitter data.
In a nutshell, the following topics will be covered in this chapter:
- Understanding text analytics
- Transformers and Estimators
- Tokenizer
- StopWordsRemover
- NGrams
- TF-IDF
- Word2Vec
- CountVectorizer
- Topic modeling using LDA
- Implementing text classification