Introduction
Text analytics is at the intersection of machine learning, mathematics, linguistics, and natural language processing. Text analytics, referred to as text mining in older literature, attempts to extract information and infer higher level concepts, sentiment, and semantic details from unstructured and semi-structured data. It is important to note that the traditional keyword searches are insufficient to deal with noisy, ambiguous, and irrelevant tokens and concepts that need to be filtered out based on the actual context.
Ultimately, what we are trying to do is for a given set of documents (text, tweets, web, and social media), is determine what the gist of the communication is and what concepts it is trying to convey (topics and concepts). These days, breaking down a document into its parts and taxonomy is too primitive to be considered text analytics. We can do better.
Spark provides a set of tools and facilities to make text analytics easier, but it is up to the users to combine...