NER with IPython over Spark
Apart from POS, one of the most common labeling problems is finding entities in the text. Typically, NER constitutes name, location and organizations. There are NER systems that tag more entities than just these three such as labeling and named entities using the context and other features. There is a lot more research going on in this area of NLP, where people are trying to tag biomedical entities, product entities, and so on.
Getting ready
To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have PySpark and Ipython installed on the Linux machine, that is, Ubuntu 14.04. For installing IPython, please refer to the Using IPython with PySpark recipe in the Chapter 2, Tricky Statistics with Spark.
How to do it…
Download and install NLTK data correctly as follows:
ipython console -profile=pyspark In [1]: In [1]: from...