Implementing openNLP - sentence detector over Spark
Partitioning text into sentences is called Sentence Boundary Disambiguation (SBD) or Sentence Detection. This process is useful for many downstream NLP tasks, which require analysis within sentences; for instance POS and phrase analysis. This Sentence Detection process is language dependent. Most search engines are not concerned with Sentence Detection. They are only interested in query's tokens and their respective positions. POS taggers and other NLP tasks that perform extraction of data will frequently process individual sentences. The detection of sentence boundaries will help separate phrases that might appear to span sentences.
Getting ready
To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. ...