Chapter 10. Text Analysis
Text analysis is a broad topic and is typically referred to as Natural Language Processing (NLP). It is used for many different tasks, including text searching, language translation, sentiment analysis, speech recognition, and classification, to mention a few. The process of analyzing can be difficult due to the particularities and ambiguity found in natural languages. However, there has been a considerable amount of work in this area and there are several Java APIs supporting this effort.
We will start with an introduction to the basic concepts and tasks used in NLP. These include the following:
- Tokenization: The process of splitting text into individual tokens or words.
- Stop words: These are words that are common and may not be necessary for processing. They include such words as the, a, and to.
- Name Entity Recognition (NER): This is the process of identifying elements of text such as people's name, locations, or things.
- Parts of Speech (POS): This identifies the grammatical...