Overview of text-processing tasks

Although there are numerous NLP tasks that can be performed, we will focus only on a subset of these tasks. A brief overview of these tasks is presented here, which is also reflected in the following chapters:

Chapter 2, Finding Parts of Text
Chapter 3, Finding Sentences
Chapter 4, Finding People and Things
Chapter 5, Detecting Parts-of-Speech
Chapter 8, Classifying Text and Documents
Chapter 10, Using Parsers to Extract Relationships
Chapter 11, Combined Approaches

Many of these tasks are used together with other tasks to achieve an objective. We will see this as we progress through the book. For example, tokenization is frequently used as an initial step in many of the other tasks. It is a fundamental and basic step.

Finding parts of text

Text can be decomposed into a number of different types of elements, such as words, sentences, and paragraphs. There are several ways of classifying these elements. When we refer to parts of text in this book, we are referring to words, sometimes called tokens. Morphology is the study of the structure of words. We will use a number of morphology terms in our exploration of NLP. However, there are many ways to classify words, including the following:

Simple words: These are the common connotations of what a word means, including the 17 words in this sentence.
Morphemes: This are the smallest unit of a word that is meaningful. For example, in the word bounded, bound is considered to be a morpheme. Morphemes also include parts such as the suffix, ed.
Prefix/suffix: This precedes or follows the root of a word. For example, in the word graduation, the ation is a suffix based on the word graduate.
Synonyms: This is a word that has the same meaning as another word. Words such as small and tiny can be recognized as synonyms. Addressing this issue requires word-sense disambiguation.
Abbreviations: These shorten the use of a word. Instead of using Mister Smith, we use Mr. Smith.
Acronyms: These are used extensively in many fields, including computer science. They use a combination of letters for phrases such as FORmula TRANslation for FORTRAN. They can be recursive, such as GNU. Of course, the one we will continue to use is NLP.
Contractions: We'll find these useful for commonly used combinations of words, such as the first word of this sentence.
Numbers: A specialized word that normally uses only digits. However, more complex versions can include a period and a special character to reflect scientific notation or numbers of a specific base.

Identifying these parts is useful for other NLP tasks. For example, to determine the boundaries of a sentence, it is necessary to break it apart and determine which elements terminate a sentence.

The process of breaking text apart is called tokenization. The result is a stream of tokens. The elements of the text that determine where elements should be split are called delimiters. For most English text, whitespace is used as a delimiter. This type of a delimiter typically includes blanks, tabs, and new line characters.

Tokenization can be simple or complex. Here, we will demonstrate a simple tokenization using the String class' split method. First, declare a string to hold the text that is to be tokenized:

String text = "Mr. Smith went to 123 Washington avenue.";

The split method uses a regular expression argument to specify how the text should be split. In the following code sequence, its argument is the \\s+ string. This specifies that one or more whitespaces will be used as the delimiter:

String tokens[] = text.split("\\s+");

A for-each statement is used to display the resulting tokens:

for(String token : tokens) { 
  System.out.println(token); 
}

When executed, the output will appear as shown here:

Mr.
Smith
went
to
123
Washington
avenue.

In Chapter 2, Finding Parts of Text, we will explore the tokenization process in depth.

Finding sentences

We tend to think of the process of identifying sentences as simple. In English, we look for termination characters, such as a period, question mark, or exclamation mark. However, as we will see in Chapter 3, Finding Sentences, this is not always that simple. Factors that make it more difficult to find the end of sentences include the use of embedded periods in such phrases as Dr. Smith or 204 SW. Park Street.

This process is also called sentence boundary disambiguation (SBD). This is a more significant problem in English than it is in languages such as Chinese or Japanese, which have unambiguous sentence delimiters.

Identifying sentences is useful for a number of reasons. Some NLP tasks, such as POS tagging and entity-extraction, work on individual sentences. Question-answering applications also need to identify individual sentences. For these processes to work correctly, sentence boundaries must be determined correctly.

The following example demonstrates how sentences can be found using the Stanford DocumentPreprocessor class. This class will generate a list of sentences based on either simple text or an XML document. The class implements the Iterable interface, allowing it to be easily used in a for-each statement.

Start by declaring a string containing the following sentences:

String paragraph = "The first sentence. The second sentence.";

Create a StringReader object based on the string. This class supports simple read type methods and is used as the argument of the DocumentPreprocessor constructor:

Reader reader = new StringReader(paragraph); 
DocumentPreprocessor documentPreprocessor =  
new DocumentPreprocessor(reader);

The DocumentPreprocessor object will now hold the sentences of the paragraph. In the following statement, a list of strings is created and is used to hold the sentences found:

List<String> sentenceList = new LinkedList<String>();

Each element of the documentPreprocessor object is then processed and consists of a list of the HasWord objects, as shown in the following block of code. The HasWord elements are objects that represent a word. An instance of StringBuilder is used to construct the sentence with each element of the hasWordList element being added to the list. When the sentence has been built, it is added to the sentenceList list:

for (List<HasWord> element : documentPreprocessor) { 
  StringBuilder sentence = new StringBuilder(); 
  List<HasWord> hasWordList = element; 
  for (HasWord token : hasWordList) { 
      sentence.append(token).append(" "); 
  } 
  sentenceList.add(sentence.toString()); 
}

A for-each statement is then used to display the sentences:

for (String sentence : sentenceList) { 
  System.out.println(sentence); 
}

The output will appear as shown here:

The first sentence . 
The second sentence .

The SBD process is covered in depth in Chapter 3, Finding Sentences.

Feature-engineering

Feature-engineering plays an essential role in developing NLP applications; it is very important for machine learning, especially in prediction-based models. It is the process of transferring the raw data into features, using domain knowledge, so that machine learning algorithms work. Features give us a more focused view of the raw data. Once the features are identified, feature-selection is done to reduce the dimension of data. When raw data is processed, the patterns or features are detected, but it may not be enough to enhance the training dataset. Engineered features enhance training by providing relevant information that helps in differentiating the patterns in the data. The new feature may not be captured or apparent in original dataset or extracted features. Hence, feature-engineering is an art and requires domain expertise. It is still a human craft, something machines are not yet good at.

Chapter 6, Representing Text with Features, will show how text documents can be presented as traditional features that do not work on text documents.

Finding people and things

Search engines do a pretty good job of meeting the needs of most users. People frequently use search engines to find the address of a business or movie showtimes. A word-processor can perform a simple search to locate a specific word or phrase in a text. However, this task can get more complicated when we need to consider other factors, such as whether synonyms should be used or whether we are interested in finding things closely related to a topic.

For example, let's say we visit a website because we are interested in buying a new laptop. After all, who doesn't need a new laptop? When you go to the site, a search engine will be used to find laptops that possess the features you are looking for. The search is frequently conducted based on a previous analysis of vendor information. This analysis often requires text to be processed in order to derive useful information that can eventually be presented to a customer.

The presentation may be in the form of facets. These are normally displayed on the left-hand side of a web page. For example, the facets for laptops might include categories such as Ultrabook, Chromebook, or Hard Disk Size. This is illustrated in the following screenshot, which is part of an Amazon web page:

Some searches can be very simple. For example, the String class and related classes have methods, such as the indexOf and lastIndexOf methods, that can find the occurrence of a String class. In the simple example that follows, the index of the occurrence of the target string is returned by the indexOf method:

String text = "Mr. Smith went to 123 Washington avenue."; 
String target = "Washington"; 
int index = text.indexOf(target); 
System.out.println(index);

The output of this sequence is shown here:

This approach is useful for only the simplest problems.

When text is searched, a common technique is to use a data structure called an inverted index. This process involves tokenizing the text and identifying terms of interest in the text along with their position. The terms and their positions are then stored in the inverted index. When a search is made for the term, it is looked up in the inverted index and the positional information is retrieved. This is faster than searching for the term in the document each time it is needed. This data structure is used frequently in databases, information-retrieval systems, and search engines.

More sophisticated searches might involve responding to queries such as: "What are some good restaurants in Boston?" To answer this query, we might need to perform entity-recognition/resolution to identify the significant terms in the query, perform semantic analysis to determine the meaning of the query, search, and then rank the candidate responses.

To illustrate the process of finding names, we use a combination of a tokenizer and the OpenNLP TokenNameFinderModel class to find names in a text. Since this technique may throw IOException, we will use a try...catch block to handle it. Declare this block and an array of strings holding the sentences, as shown here:

try { 
    String[] sentences = { 
         "Tim was a good neighbor. Perhaps not as good a Bob " +  
        "Haywood, but still pretty good. Of course Mr. Adam " +  
        "took the cake!"}; 
    // Insert code to find the names here 
  } catch (IOException ex) { 
    ex.printStackTrace(); 
}

Before the sentences can be processed, we need to tokenize the text. Set up the tokenizer using the Tokenizer class, as shown here:

Tokenizer tokenizer = SimpleTokenizer.INSTANCE;

We will need to use a model to detect sentences. This is needed to avoid grouping terms that may span sentence boundaries. We will use the TokenNameFinderModel class based on the model found in the en-ner-person.bin file. An instance of TokenNameFinderModel is created from this file as follows:

TokenNameFinderModel model = new TokenNameFinderModel( 
new File("C:\\OpenNLP Models", "en-ner-person.bin"));

The NameFinderME class will perform the actual task of finding the name. An instance of this class is created using the TokenNameFinderModel instance, as shown here:

NameFinderME finder = new NameFinderME(model);

Use a for-each statement to process each sentence, as shown in the following code sequence. The tokenize method will split the sentence into tokens and the find method returns an array of Span objects. These objects store the starting and ending indexes for the names identified by the find method:

for (String sentence : sentences) { 
    String[] tokens = tokenizer.tokenize(sentence); 
    Span[] nameSpans = finder.find(tokens); 
    System.out.println(Arrays.toString( 
    Span.spansToStrings(nameSpans, tokens))); 
}

When executed, it will generate the following output:

[Tim, Bob Haywood, Adam]

The primary focus of Chapter 4, Finding People and Things, is name recognition.

Detecting parts of speech

Another way of classifying the parts of text is at the sentence level. A sentence can be decomposed into individual words or combinations of words according to categories, such as nouns, verbs, adverbs, and prepositions. Most of us learned how to do this in school. We also learned not to end a sentence with a preposition, contrary to what we did in the second sentence of this paragraph.

Detecting the POS is useful in other tasks, such as extracting relationships and determining the meaning of text. Determining these relationships is called parsing. POS processing is useful for enhancing the quality of data sent to other elements of a pipeline.

The internals of a POS process can be complex. Fortunately, most of the complexity is hidden from us and encapsulated in classes and methods. We will use a couple of OpenNLP classes to illustrate this process. We will need a model to detect the POS. The POSModel class will be used and instanced using the model found in the en-pos-maxent.bin file, as shown here:

POSModel model = new POSModelLoader().load( 
    new File("../OpenNLP Models/" "en-pos-maxent.bin"));

The POSTaggerME class is used to perform the actual tagging. Create an instance of this class based on the previous model, as shown here:

POSTaggerME tagger = new POSTaggerME(model);

Next, declare a string containing the text to be processed:

String sentence = "POS processing is useful for enhancing the "  
   + "quality of data sent to other elements of a pipeline.";

Here, we will use WhitespaceTokenizer to tokenize the text:

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence);

The tag method is then used to find those parts of speech that stored the results
in an array of strings:

String[] tags = tagger.tag(tokens);

The tokens and their corresponding tags are then displayed:

for(int i=0; i<tokens.length; i++) { 
    System.out.print(tokens[i] + "[" + tags[i] + "] "); 
}

When executed, the following output will be produced:

    POS[NNP] processing[NN] is[VBZ] useful[JJ] for[IN] enhancing[VBG] the[DT] quality[NN] of[IN] data[NNS] sent[VBN] to[TO] other[JJ] elements[NNS] of[IN] a[DT] pipeline.[NN]

Each token is followed by an abbreviation, contained within brackets, for its POS. For example, NNP means that it is a proper noun. These abbreviations will be covered in Chapter 5, Detecting Parts-of-Speech, which is devoted to exploring this topic in depth.

Classifying text and documents

Classification is concerned with assigning labels to information found in text or documents. These labels may or may not be known when the process occurs. When labels are known, the process is called classification. When the labels are unknown, the process is called clustering.

Also of interest in NLP is the process of categorization. This is the process of assigning some text element into one of several possible groups. For example, military aircrafts can be categorized as either fighter, bomber, surveillance, transport, or rescue.

Classifiers can be organized by the type of output they produce. This can be binary, which results in a yes/no output. This type is often used to support spam filters. Other types will result in multiple possible categories.

Classification is more of a process than many of the other NLP tasks. It involves the steps that we will discuss in the Understanding NLP models section. Due to the length of this process, we will not illustrate it here. In Chapter 8, Classifying Text and Documents, we will investigate the classification process and provide a detailed example.

Extracting relationships

Relationship-extraction identifies relationships that exist in text. For example, with the sentence, "The meaning and purpose of life is plain to see," we know that the topic of the sentence is "The meaning and purpose of life." It is related to the last phrase that suggests that it is "plain to see."

Humans can do a pretty good job of determining how things are related to each other, at least at a high level. Determining deep relationships can be more difficult. Using a computer to extract relationships can also be challenging. However, computers can process large datasets to find relationships that would not be obvious to a human or that could not be done in a reasonable period of time.

Numerous relationships are possible. These include relationships such as where something is located, how two people are related to each other, the parts of a system, and who is in charge. Relationship-extraction is useful for a number of tasks, including building knowledge bases, performing trend-analysis, gathering intelligence, and performing product searches. Finding relationships is sometimes called text analytics.

There are several techniques that we can use to perform relationship-extractions. These are covered in more detail in Chapter 10, Using Parser to Extract Relationships. Here, we will illustrate one technique to identify relationships within a sentence using the Stanford NLP StanfordCoreNLP class. This class supports a pipeline where annotators are specified and applied to text. Annotators can be thought of as operations to be performed. When an instance of the class is created, the annotators are added using a Properties object found in the java.util package.

First, create an instance of the Properties class. Then, assign the annotators as follows:

Properties properties = new Properties();         
properties.put("annotators", "tokenize, ssplit, parse");

We used three annotators, which specify the operations to be performed. In this case, these are the minimum required to parse the text. The first one, tokenize, will tokenize the text. The ssplit annotator splits the tokens into sentences. The last annotator, parse, performs the syntactic analysis, the parsing of the text.

Next, create an instance of the StanfordCoreNLP class using the properties' reference variable:

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);

Then, an Annotation instance is created, which uses the text as its argument:

Annotation annotation = new Annotation( 
    "The meaning and purpose of life is plain to see.");

Apply the annotate method against the pipeline object to process the annotation object. Finally, use the prettyPrint method to display the result of the processing:

pipeline.annotate(annotation); 
pipeline.prettyPrint(annotation, System.out);

The output of this code is shown as follows:

    Sentence #1 (11 tokens):
    The meaning and purpose of life is plain to see.
    [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] [Text=meaning CharacterOffsetBegin=4 CharacterOffsetEnd=11 PartOfSpeech=NN] [Text=and CharacterOffsetBegin=12 CharacterOffsetEnd=15 PartOfSpeech=CC] [Text=purpose CharacterOffsetBegin=16 CharacterOffsetEnd=23 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=24 CharacterOffsetEnd=26 PartOfSpeech=IN] [Text=life CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=NN] [Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=plain CharacterOffsetBegin=35 CharacterOffsetEnd=40 PartOfSpeech=JJ] [Text=to CharacterOffsetBegin=41 CharacterOffsetEnd=43 PartOfSpeech=TO] [Text=see CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=VB] [Text=. CharacterOffsetBegin=47 CharacterOffsetEnd=48 PartOfSpeech=.] 
    (ROOT
      (S
        (NP
          (NP (DT The) (NN meaning)
            (CC and)
            (NN purpose))
          (PP (IN of)
            (NP (NN life))))
        (VP (VBZ is)
          (ADJP (JJ plain)
            (S
              (VP (TO to)
                (VP (VB see))))))
        (. .)))
    
    root(ROOT-0, plain-8)
    det(meaning-2, The-1)
    nsubj(plain-8, meaning-2)
    conj_and(meaning-2, purpose-4)
    prep_of(meaning-2, life-6)
    cop(plain-8, is-7)
    aux(see-10, to-9)
    xcomp(plain-8, see-10)

The first part of the output displays the text along with the tokens and POS. This is followed by a tree-like structure that shows the organization of the sentence. The last part shows the relationships between the elements at a grammatical level. Consider the following example:

prep_of(meaning-2, life-6)

This shows how the preposition, of, is used to relate the words meaning and life. This information is useful for many text-simplification tasks.

Using combined approaches

As suggested earlier, NLP problems often involve using more than one basic NLP task. These are frequently combined in a pipeline to obtain the desired results. We saw one use of a pipeline in the previous section, Extracting relationships.

Most NLP solutions will use pipelines. We will provide several examples of pipelines in Chapter 11, Combined Pipeline.