Overview of text processing tasks
Although there are numerous NLP tasks that can be performed, we will focus only on a subset of these tasks. A brief overview of these tasks is presented here, which is also reflected in the following chapters:
- Finding Parts of Text
- Finding Sentences
- Finding People and Things
- Detecting Parts of Speech
- Classifying Text and Documents
- Extracting Relationships
- Combined Approaches
Many of these tasks are used together with other tasks to achieve some objective. We will see this as we progress through the book. For example, tokenization is frequently used as an initial step in many of the other tasks. It is a fundamental and basic step.
Finding parts of text
Text can be decomposed into a number of different types of elements such as words, sentences, and paragraphs. There are several ways of classifying these elements. When we refer to parts of text in this book, we are referring to words, sometimes called tokens. Morphology is the study of the structure of words. We will use a number of morphology terms in our exploration of NLP. However, there are many ways of classifying words including the following:
- Simple words: These are the common connotations of what a word means including the 17 words of this sentence.
- Morphemes: These are the smallest units of a word that is meaningful. For example, in the word "bounded", "bound" is considered to be a morpheme. Morphemes also include parts such as the suffix, "ed".
- Prefix/Suffix: This precedes or follows the root of a word. For example, in the word graduation, the "ation" is a suffix based on the word "graduate".
- Synonyms: This is a word that has the same meaning as another word. Words such as small and tiny can be recognized as synonyms. Addressing this issue requires word sense disambiguation.
- Abbreviations: These shorten the use of a word. Instead of using Mister Smith, we use Mr. Smith.
- Acronyms: These are used extensively in many fields including computer science. They use a combination of letters for phrases such as FORmula TRANslation for FORTRAN. They can be recursive such as GNU. Of course, the one we will continue to use is NLP.
- Contractions: We'll find these useful for commonly used combinations of words such as the first word of this sentence.
- Numbers: A specialized word that normally uses only digits. However, more complex versions can include a period and a special character to reflect scientific notation or numbers of a specific base.
Identifying these parts is useful for other NLP tasks. For example, to determine the boundaries of a sentence, it is necessary to break it apart and determine which elements terminate a sentence.
The process of breaking text apart is called tokenization. The result is a stream of tokens. The elements of the text that determine where elements should be split are called Delimiters. For most English text, whitespace is used as a delimiter. This type of a delimiter typically includes blanks, tabs, and new line characters.
Tokenization can be simple or complex. Here, we will demonstrate a simple tokenization using the String
class' split
method. First, declare a string to hold the text that is to be tokenized:
String text = "Mr. Smith went to 123 Washington avenue.";
The split
method uses a regular expression argument to specify how the text should be split. In the next code sequence, its argument is the string \\s+
. This specifies that one or more whitespaces be used as the delimiter:
String tokens[] = text.split("\\s+");
A for-each statement is used to display the resulting tokens:
for(String token : tokens) { System.out.println(token); }
When executed, the output will appear as shown here:
Mr. Smith went to 123 Washington avenue.
In Chapter 2, Finding Parts of Text, we will explore the tokenization process in depth.
Finding sentences
We tend to think of the process of identifying sentences as a simple process. In English, we look for termination characters such as a period, question mark, or exclamation mark. However, as we will see in Chapter 3, Finding Sentences, this is not always that simple. Factors that make it more difficult to find the end of sentences include the use of embedded periods in such phrases as "Dr. Smith" or "204 SW. Park Street".
This process is also called Sentence Boundary Disambiguation (SBD). This is a more significant problem in English than it is in languages such as Chinese or Japanese that have unambiguous sentence delimiters.
Identifying sentences is useful for a number of reasons. Some NLP tasks, such as POS tagging and entity extraction, work on individual sentences. Question-anwering applications also need to identify individual sentences. For these processes to work correctly, sentence boundaries must be determined correctly.
The following example demonstrates how sentences can be found using the Stanford DocumentPreprocessor
class. This class will generate a list of sentences based on either simple text or an XML document. The class implements the Iterable
interface allowing it to be easily used in a for-each statement.
Start by declaring a string containing the sentences, as shown here:
String paragraph = "The first sentence. The second sentence.";
Create a StringReader
object based on the string. This class supports simple read
type methods and is used as the argument of the DocumentPreprocessor
constructor:
Reader reader = new StringReader(paragraph); DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader);
The DocumentPreprocessor
object will now hold the sentences of the paragraph. In the next statement, a list of strings is created and is used to hold the sentences found:
List<String> sentenceList = new LinkedList<String>();
Each element of the documentPreprocessor
object is then processed and consists of a list of the HasWord
objects, as shown in the following block of code. The HasWord
elements are objects that represent a word. An instance of StringBuilder
is used to construct the sentence with each element of the hasWordList
element being added to the list. When the sentence has been built, it is added to the sentenceList
list:
for (List<HasWord> element : documentPreprocessor) { StringBuilder sentence = new StringBuilder(); List<HasWord> hasWordList = element; for (HasWord token : hasWordList) { sentence.append(token).append(" "); } sentenceList.add(sentence.toString()); }
A for-each statement is then used to display the sentences:
for (String sentence : sentenceList) { System.out.println(sentence); }
The output will appear as shown here:
The first sentence . The second sentence .
The SBD process is covered in depth in Chapter 3, Finding Sentences.
Finding people and things
Search engines do a pretty good job of meeting the needs of most users. People frequently use a search engine to find the address of a business or movie show times. A word processor can perform a simple search to locate a specific word or phrase in a text. However, this task can get more complicated when we need to consider other factors such as whether synonyms should be used or if we are interested in finding things closely related to a topic.
For example, let's say we visit a website because we are interested in buying a new laptop. After all, who doesn't need a new laptop? When you go to the site, a search engine will be used to find laptops that possess the features you are looking for. The search is frequently conducted based on previous analysis of vendor information. This analysis often requires text to be processed in order to derive useful information that can eventually be presented to a customer.
The presentation may be in the form of facets. These are normally displayed on the left-hand side of a web page. For example, the facets for laptops might include categories such as an Ultrabook, Chromebook, or hard disk size. This is illustrated in the following figure, which is part of an Amazon web page:
Some searches can be very simple. For example, the String
class and related classes have methods such as the indexOf
and lastIndexOf
methods that can find the occurrence of a String
class. In the simple example that follows, the index of the occurrence of the target string is returned by the indexOf
method:
String text = "Mr. Smith went to 123 Washington avenue."; String target = "Washington"; int index = text.indexOf(target); System.out.println(index);
The output of this sequence is shown here:
22
This approach is useful for only the simplest problems.
When text is searched, a common technique is to use a data structure called an inverted index. This process involves tokenizing the text and identifying terms of interest in the text along with their position. The terms and their positions are then stored in the inverted index. When a search is made for the term, it is looked up in the inverted index and the positional information is retrieved. This is faster than searching for the term in the document each time it is needed. This data structure is used frequently in databases, information retrieval systems, and search engines.
More sophisticated searches might involve responding to queries such as: "Where are good restaurants in Boston?" To answer this query we might need to perform entity recognition/resolution to identify the significant terms in the query, perform semantic analysis to determine the meaning of the query, search and then rank candidate responses.
To illustrate the process of finding names, we use a combination of a tokenizer and the OpenNLP TokenNameFinderModel
class to find names in a text. Since this technique may throw an IOException
, we will use a try-catch
block to handle it. Declare this block and an array of strings holding the sentences, as shown here:
try { String[] sentences = { "Tim was a good neighbor. Perhaps not as good a Bob " + "Haywood, but still pretty good. Of course Mr. Adam " + "took the cake!"}; // Insert code to find the names here } catch (IOException ex) { ex.printStackTrace(); }
Before the sentences can be processed, we need to tokenize the text. Set up the tokenizer using the Tokenizer
class, as shown here:
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
We will need to use a model to detect sentences. This is needed to avoid grouping terms that may span sentence boundaries. We will use the TokenNameFinderModel
class based on the model found in the en-ner-person.bin
file. An instance of TokenNameFinderModel
is created from this file as follows:
TokenNameFinderModel model = new TokenNameFinderModel( new File("C:\\OpenNLP Models", "en-ner-person.bin"));
The NameFinderME
class will perform the actual task of finding the name. An instance of this class is created using the TokenNameFinderModel
instance, as shown here:
NameFinderME finder = new NameFinderME(model);
Use a for-each statement to process each sentence as shown in the following code sequence. The
tokenize
method will split the sentence into tokens and the find
method returns an array of Span
objects. These objects store the starting and ending indexes for the names identified by the find
method:
for (String sentence : sentences) { String[] tokens = tokenizer.tokenize(sentence); Span[] nameSpans = finder.find(tokens); System.out.println(Arrays.toString( Span.spansToStrings(nameSpans, tokens))); }
When executed, it will generate the following output:
[Tim, Bob Haywood, Adam]
The primary focus of Chapter 4, Finding People and Things, is name recognition.
Detecting Parts of Speech
Another way of classifying the parts of text is at the sentence level. A sentence can be decomposed into individual words or combinations of words according to categories, such as nouns, verbs, adverbs, and prepositions. Most of us learned how to do this in school. We also learned not to end a sentence with a preposition contrary to what we did in the second sentence of this paragraph.
Detecting the Parts of Speech (POS) is useful in other tasks such as extracting relationships and determining the meaning of text. Determining these relationships is called Parsing. POS processing is useful for enhancing the quality of data sent to other elements of a pipeline.
The internals of a POS process can be complex. Fortunately, most of the complexity is hidden from us and encapsulated in classes and methods. We will use a couple of OpenNLP classes to illustrate this process. We will need a model to detect the POS. The POSModel
class will be used and instanced using the model found in the en-pos-maxent.bin
file, as shown here:
POSModel model = new POSModelLoader().load( new File("../OpenNLP Models/" "en-pos-maxent.bin"));
The POSTaggerME
class is used to perform the actual tagging. Create an instance of this class based on the previous model as shown here:
POSTaggerME tagger = new POSTaggerME(model);
Next, declare a string containing the text to be processed:
String sentence = "POS processing is useful for enhancing the " + "quality of data sent to other elements of a pipeline.";
Here, we will use a whitespace tokenizer to tokenize the text:
String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence);
The tag
method is then used to find those parts of speech, which stored the results in an array of strings:
String[] tags = tagger.tag(tokens);
The tokens and their corresponding tags are then displayed:
for(int i=0; i<tokens.length; i++) { System.out.print(tokens[i] + "[" + tags[i] + "] "); }
When executed, the following output will be produced:
POS[NNP] processing[NN] is[VBZ] useful[JJ] for[IN] enhancing[VBG] the[DT] quality[NN] of[IN] data[NNS] sent[VBN] to[TO] other[JJ] elements[NNS] of[IN] a[DT] pipeline.[NN]
Each token is followed by an abbreviation, contained within brackets, for its part of speech. For example, NNP means that it is a proper noun. These abbreviations will be covered in Chapter 5, Detecting Parts of Speech, which is devoted to exploring this topic in depth.
Classifying text and documents
Classification is concerned with assigning labels to information found in text or documents. These labels may or may not be known when the process occurs. When labels are known, the process is called classification. When the labels are unknown, the process is called clustering.
Also of interest in NLP is the process of categorization. This is the process of assigning some text element into one of the several possible groups. For example, military aircraft can be categorized as either fighter, bomber, surveillance, transport, or rescue.
Classifiers can be organized by the type of output they produce. This can be binary, which results in a yes/no output. This type is often used to support spam filters. Other types will result in multiple possible categories.
Classification is more of a process than many of the other NLP tasks. It involves the steps that we will discuss in Understanding NLP models later in the chapter. Due to the length of this process, we will not illustrate the process here. In Chapter 6, Classifying Text and Documents, we will investigate the classification process and provide a detailed example.
Extracting relationships
Relationship extraction identifies relationships that exist in text. For example, with the sentence "The meaning and purpose of life is plain to see", we know that the topic of the sentence is "The meaning and purpose of life". It is related to the last phrase that suggests that it is "plain to see".
Humans can do a pretty good job at determining how things are related to each other, at least at a high level. Determining deep relationships can be more difficult. Using a computer to extract relationships can also be challenging. However, computers can process large datasets to find relationships that would not be obvious to a human or that could not be done in a reasonable period of time.
There are numerous relationships possible. These include relationships such as where something is located, how two people are related to each other, what are the parts of a system, and who is in charge. Relationship extraction is useful for a number of tasks including building knowledge bases, performing analysis of trends, gathering intelligence, and performing product searches. Finding relationships is sometimes called Text Analytics.
There are several techniques that we can use to perform relationship extractions. These are covered in more detail in Chapter 7, Using a Parser to Extract Relationships. Here, we will illustrate one technique to identify relationships within a sentence using the Stanford NLP StanfordCoreNLP
class. This class supports a pipeline where annotators are specified and applied to text. Annotators can be thought of as operations to be performed. When an instance of the class is created, the annotators are added using a Properties
object found in the java.util
package.
First, create an instance of the Properties
class. Then assign the annotators as follows:
Properties properties = new Properties(); properties.put("annotators", "tokenize, ssplit, parse");
We used three annotators, which specify the operations to be performed. In this case, these are the minimum required to parse the text. The first one, tokenize
, will tokenize the text. The ssplit
annotator splits the tokens into sentences. The last annotator, parse
, performs the syntactic analysis, parsing, of the text.
Next, create an instance of the StanfordCoreNLP
class using the properties' reference variable:
StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
Next, an
Annotation
instance is created, which uses the text as its argument:
Annotation annotation = new Annotation( "The meaning and purpose of life is plain to see.");
Apply the annotate
method against the pipeline object to process the annotation
object. Finally, use the prettyPrint
method to display the result of the processing:
pipeline.annotate(annotation); pipeline.prettyPrint(annotation, System.out);
The output of this code is shown as follows:
Sentence #1 (11 tokens): The meaning and purpose of life is plain to see. [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] [Text=meaning CharacterOffsetBegin=4 CharacterOffsetEnd=11 PartOfSpeech=NN] [Text=and CharacterOffsetBegin=12 CharacterOffsetEnd=15 PartOfSpeech=CC] [Text=purpose CharacterOffsetBegin=16 CharacterOffsetEnd=23 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=24 CharacterOffsetEnd=26 PartOfSpeech=IN] [Text=life CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=NN] [Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=plain CharacterOffsetBegin=35 CharacterOffsetEnd=40 PartOfSpeech=JJ] [Text=to CharacterOffsetBegin=41 CharacterOffsetEnd=43 PartOfSpeech=TO] [Text=see CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=VB] [Text=. CharacterOffsetBegin=47 CharacterOffsetEnd=48 PartOfSpeech=.] (ROOT (S (NP (NP (DT The) (NN meaning) (CC and) (NN purpose)) (PP (IN of) (NP (NN life)))) (VP (VBZ is) (ADJP (JJ plain) (S (VP (TO to) (VP (VB see)))))) (. .))) root(ROOT-0, plain-8) det(meaning-2, The-1) nsubj(plain-8, meaning-2) conj_and(meaning-2, purpose-4) prep_of(meaning-2, life-6) cop(plain-8, is-7) aux(see-10, to-9) xcomp(plain-8, see-10)
The first part of the output displays the text along with the tokens and POS. This is followed by a tree-like structure showing the organization of the sentence. The last part shows relationships between the elements at a grammatical level. Consider the following example:
prep_of(meaning-2, life-6)
This shows how the preposition, "of", is used to relate the words "meaning" and "life". This information is useful for many text simplification tasks.
Using combined approaches
As suggested earlier, NLP problems often involve using more than one basic NLP task. These are frequently combined in a pipeline to obtain the desired results. We saw one use of a pipeline in the previous section, Extracting relationships.
Most NLP solutions will use pipelines. We will provide several examples of pipelines in Chapter 8, Combined Approaches.