Representing language for NLP applications
For computers to work with natural language, it has to be represented in a form that they can process. These representations can be symbolic, where the words in a text are processed directly, or numeric, where the representation is in the form of numbers. We will describe both of these approaches here. Although the numeric approach is the primary approach currently used in NLP research and applications, it is worth becoming somewhat familiar with the ideas behind symbolic processing.
Symbolic representations
Traditionally, NLP has been based on processing the words in texts directly, as words. This approach was embodied in a standard approach where the text was analyzed in a series of steps that were aimed at converting an input consisting of unanalyzed words into a meaning. In a traditional NLP pipeline, shown in Figure 7.1, each step in processing, from input text to meaning, produces an output that adds more structure to its input...