Named entity recognition
Building a web scraper that enriches an input dataset containing URLs with external web-based HTML content is of great business value within a big data ingestion service. But while an average data scientist should be able to study the returned content by using some basic clustering and classification techniques, an expert data scientist will bring this data enrichment process to the next level, by further enriching and adding value to it in post processes. Commonly, these value-added, post processes include disambiguating the external text content, extracting entities (like People, Places, and Dates), and converting raw text into its simplest grammatical form. We will explain in this section how to leverage the Spark framework in order to create a reliable Natural Language Processing (NLP) pipeline that includes these valuable post-processed outputs, and which handles English language-based content at any scale.
Scala libraries
ScalaNLP (http://www.scalanlp.org/) is...