Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Natural Language Processing with Java

You're reading from   Natural Language Processing with Java Explore various approaches to organize and extract useful text from unstructured data using Java

Arrow left icon
Product type Paperback
Published in Mar 2015
Publisher
ISBN-13 9781784391799
Length 262 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Richard M. Reese Richard M. Reese
Author Profile Icon Richard M. Reese
Richard M. Reese
Richard M Reese Richard M Reese
Author Profile Icon Richard M Reese
Richard M Reese
Arrow right icon
View More author details
Toc

Overview of text processing tasks

Although there are numerous NLP tasks that can be performed, we will focus only on a subset of these tasks. A brief overview of these tasks is presented here, which is also reflected in the following chapters:

  • Finding Parts of Text
  • Finding Sentences
  • Finding People and Things
  • Detecting Parts of Speech
  • Classifying Text and Documents
  • Extracting Relationships
  • Combined Approaches

Many of these tasks are used together with other tasks to achieve some objective. We will see this as we progress through the book. For example, tokenization is frequently used as an initial step in many of the other tasks. It is a fundamental and basic step.

Finding parts of text

Text can be decomposed into a number of different types of elements such as words, sentences, and paragraphs. There are several ways of classifying these elements. When we refer to parts of text in this book, we are referring to words, sometimes called tokens. Morphology is the study of the structure of words. We will use a number of morphology terms in our exploration of NLP. However, there are many ways of classifying words including the following:

  • Simple words: These are the common connotations of what a word means including the 17 words of this sentence.
  • Morphemes: These are the smallest units of a word that is meaningful. For example, in the word "bounded", "bound" is considered to be a morpheme. Morphemes also include parts such as the suffix, "ed".
  • Prefix/Suffix: This precedes or follows the root of a word. For example, in the word graduation, the "ation" is a suffix based on the word "graduate".
  • Synonyms: This is a word that has the same meaning as another word. Words such as small and tiny can be recognized as synonyms. Addressing this issue requires word sense disambiguation.
  • Abbreviations: These shorten the use of a word. Instead of using Mister Smith, we use Mr. Smith.
  • Acronyms: These are used extensively in many fields including computer science. They use a combination of letters for phrases such as FORmula TRANslation for FORTRAN. They can be recursive such as GNU. Of course, the one we will continue to use is NLP.
  • Contractions: We'll find these useful for commonly used combinations of words such as the first word of this sentence.
  • Numbers: A specialized word that normally uses only digits. However, more complex versions can include a period and a special character to reflect scientific notation or numbers of a specific base.

Identifying these parts is useful for other NLP tasks. For example, to determine the boundaries of a sentence, it is necessary to break it apart and determine which elements terminate a sentence.

The process of breaking text apart is called tokenization. The result is a stream of tokens. The elements of the text that determine where elements should be split are called Delimiters. For most English text, whitespace is used as a delimiter. This type of a delimiter typically includes blanks, tabs, and new line characters.

Tokenization can be simple or complex. Here, we will demonstrate a simple tokenization using the String class' split method. First, declare a string to hold the text that is to be tokenized:

String text = "Mr. Smith went to 123 Washington avenue.";

The split method uses a regular expression argument to specify how the text should be split. In the next code sequence, its argument is the string \\s+. This specifies that one or more whitespaces be used as the delimiter:

String tokens[] = text.split("\\s+");

A for-each statement is used to display the resulting tokens:

for(String token : tokens) {
  System.out.println(token);
}

When executed, the output will appear as shown here:

Mr.
Smith
went
to
123
Washington
avenue.

In Chapter 2, Finding Parts of Text, we will explore the tokenization process in depth.

Finding sentences

We tend to think of the process of identifying sentences as a simple process. In English, we look for termination characters such as a period, question mark, or exclamation mark. However, as we will see in Chapter 3, Finding Sentences, this is not always that simple. Factors that make it more difficult to find the end of sentences include the use of embedded periods in such phrases as "Dr. Smith" or "204 SW. Park Street".

This process is also called Sentence Boundary Disambiguation (SBD). This is a more significant problem in English than it is in languages such as Chinese or Japanese that have unambiguous sentence delimiters.

Identifying sentences is useful for a number of reasons. Some NLP tasks, such as POS tagging and entity extraction, work on individual sentences. Question-anwering applications also need to identify individual sentences. For these processes to work correctly, sentence boundaries must be determined correctly.

The following example demonstrates how sentences can be found using the Stanford DocumentPreprocessor class. This class will generate a list of sentences based on either simple text or an XML document. The class implements the Iterable interface allowing it to be easily used in a for-each statement.

Start by declaring a string containing the sentences, as shown here:

String paragraph = "The first sentence. The second sentence.";

Create a StringReader object based on the string. This class supports simple read type methods and is used as the argument of the DocumentPreprocessor constructor:

Reader reader = new StringReader(paragraph);
DocumentPreprocessor documentPreprocessor = 
new DocumentPreprocessor(reader);

The DocumentPreprocessor object will now hold the sentences of the paragraph. In the next statement, a list of strings is created and is used to hold the sentences found:

List<String> sentenceList = new LinkedList<String>();

Each element of the documentPreprocessor object is then processed and consists of a list of the HasWord objects, as shown in the following block of code. The HasWord elements are objects that represent a word. An instance of StringBuilder is used to construct the sentence with each element of the hasWordList element being added to the list. When the sentence has been built, it is added to the sentenceList list:

for (List<HasWord> element : documentPreprocessor) {
  StringBuilder sentence = new StringBuilder();
  List<HasWord> hasWordList = element;
  for (HasWord token : hasWordList) {
      sentence.append(token).append(" ");
  }
  sentenceList.add(sentence.toString());
}

A for-each statement is then used to display the sentences:

for (String sentence : sentenceList) {
  System.out.println(sentence);
}

The output will appear as shown here:

The first sentence . 
The second sentence . 

The SBD process is covered in depth in Chapter 3, Finding Sentences.

Finding people and things

Search engines do a pretty good job of meeting the needs of most users. People frequently use a search engine to find the address of a business or movie show times. A word processor can perform a simple search to locate a specific word or phrase in a text. However, this task can get more complicated when we need to consider other factors such as whether synonyms should be used or if we are interested in finding things closely related to a topic.

For example, let's say we visit a website because we are interested in buying a new laptop. After all, who doesn't need a new laptop? When you go to the site, a search engine will be used to find laptops that possess the features you are looking for. The search is frequently conducted based on previous analysis of vendor information. This analysis often requires text to be processed in order to derive useful information that can eventually be presented to a customer.

The presentation may be in the form of facets. These are normally displayed on the left-hand side of a web page. For example, the facets for laptops might include categories such as an Ultrabook, Chromebook, or hard disk size. This is illustrated in the following figure, which is part of an Amazon web page:

Finding people and things

Some searches can be very simple. For example, the String class and related classes have methods such as the indexOf and lastIndexOf methods that can find the occurrence of a String class. In the simple example that follows, the index of the occurrence of the target string is returned by the indexOf method:

String text = "Mr. Smith went to 123 Washington avenue.";
String target = "Washington";
int index = text.indexOf(target);
System.out.println(index);

The output of this sequence is shown here:

22

This approach is useful for only the simplest problems.

When text is searched, a common technique is to use a data structure called an inverted index. This process involves tokenizing the text and identifying terms of interest in the text along with their position. The terms and their positions are then stored in the inverted index. When a search is made for the term, it is looked up in the inverted index and the positional information is retrieved. This is faster than searching for the term in the document each time it is needed. This data structure is used frequently in databases, information retrieval systems, and search engines.

More sophisticated searches might involve responding to queries such as: "Where are good restaurants in Boston?" To answer this query we might need to perform entity recognition/resolution to identify the significant terms in the query, perform semantic analysis to determine the meaning of the query, search and then rank candidate responses.

To illustrate the process of finding names, we use a combination of a tokenizer and the OpenNLP TokenNameFinderModel class to find names in a text. Since this technique may throw an IOException, we will use a try-catch block to handle it. Declare this block and an array of strings holding the sentences, as shown here:

try {
    String[] sentences = { "Tim was a good neighbor. Perhaps not as good a Bob " + 
        "Haywood, but still pretty good. Of course Mr. Adam " + 
        "took the cake!"};
    // Insert code to find the names here
  } catch (IOException ex) {
    ex.printStackTrace();
}

Before the sentences can be processed, we need to tokenize the text. Set up the tokenizer using the Tokenizer class, as shown here:

Tokenizer tokenizer = SimpleTokenizer.INSTANCE;

We will need to use a model to detect sentences. This is needed to avoid grouping terms that may span sentence boundaries. We will use the TokenNameFinderModel class based on the model found in the en-ner-person.bin file. An instance of TokenNameFinderModel is created from this file as follows:

TokenNameFinderModel model = new TokenNameFinderModel(
new File("C:\\OpenNLP Models", "en-ner-person.bin"));

The NameFinderME class will perform the actual task of finding the name. An instance of this class is created using the TokenNameFinderModel instance, as shown here:

NameFinderME finder = new NameFinderME(model);

Use a for-each statement to process each sentence as shown in the following code sequence. The tokenize method will split the sentence into tokens and the find method returns an array of Span objects. These objects store the starting and ending indexes for the names identified by the find method:

for (String sentence : sentences) {
    String[] tokens = tokenizer.tokenize(sentence);
    Span[] nameSpans = finder.find(tokens);
    System.out.println(Arrays.toString(
    Span.spansToStrings(nameSpans, tokens)));
}

When executed, it will generate the following output:

[Tim, Bob Haywood, Adam]

The primary focus of Chapter 4, Finding People and Things, is name recognition.

Detecting Parts of Speech

Another way of classifying the parts of text is at the sentence level. A sentence can be decomposed into individual words or combinations of words according to categories, such as nouns, verbs, adverbs, and prepositions. Most of us learned how to do this in school. We also learned not to end a sentence with a preposition contrary to what we did in the second sentence of this paragraph.

Detecting the Parts of Speech (POS) is useful in other tasks such as extracting relationships and determining the meaning of text. Determining these relationships is called Parsing. POS processing is useful for enhancing the quality of data sent to other elements of a pipeline.

The internals of a POS process can be complex. Fortunately, most of the complexity is hidden from us and encapsulated in classes and methods. We will use a couple of OpenNLP classes to illustrate this process. We will need a model to detect the POS. The POSModel class will be used and instanced using the model found in the en-pos-maxent.bin file, as shown here:

POSModel model = new POSModelLoader().load(
    new File("../OpenNLP Models/" "en-pos-maxent.bin"));

The POSTaggerME class is used to perform the actual tagging. Create an instance of this class based on the previous model as shown here:

POSTaggerME tagger = new POSTaggerME(model);

Next, declare a string containing the text to be processed:

String sentence = "POS processing is useful for enhancing the " 
   + "quality of data sent to other elements of a pipeline.";

Here, we will use a whitespace tokenizer to tokenize the text:

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence);

The tag method is then used to find those parts of speech, which stored the results in an array of strings:

String[] tags = tagger.tag(tokens);

The tokens and their corresponding tags are then displayed:

for(int i=0; i<tokens.length; i++) {
    System.out.print(tokens[i] + "[" + tags[i] + "] ");
}

When executed, the following output will be produced:

POS[NNP] processing[NN] is[VBZ] useful[JJ] for[IN] enhancing[VBG] the[DT] quality[NN] of[IN] data[NNS] sent[VBN] to[TO] other[JJ] elements[NNS] of[IN] a[DT] pipeline.[NN]

Each token is followed by an abbreviation, contained within brackets, for its part of speech. For example, NNP means that it is a proper noun. These abbreviations will be covered in Chapter 5, Detecting Parts of Speech, which is devoted to exploring this topic in depth.

Classifying text and documents

Classification is concerned with assigning labels to information found in text or documents. These labels may or may not be known when the process occurs. When labels are known, the process is called classification. When the labels are unknown, the process is called clustering.

Also of interest in NLP is the process of categorization. This is the process of assigning some text element into one of the several possible groups. For example, military aircraft can be categorized as either fighter, bomber, surveillance, transport, or rescue.

Classifiers can be organized by the type of output they produce. This can be binary, which results in a yes/no output. This type is often used to support spam filters. Other types will result in multiple possible categories.

Classification is more of a process than many of the other NLP tasks. It involves the steps that we will discuss in Understanding NLP models later in the chapter. Due to the length of this process, we will not illustrate the process here. In Chapter 6, Classifying Text and Documents, we will investigate the classification process and provide a detailed example.

Extracting relationships

Relationship extraction identifies relationships that exist in text. For example, with the sentence "The meaning and purpose of life is plain to see", we know that the topic of the sentence is "The meaning and purpose of life". It is related to the last phrase that suggests that it is "plain to see".

Humans can do a pretty good job at determining how things are related to each other, at least at a high level. Determining deep relationships can be more difficult. Using a computer to extract relationships can also be challenging. However, computers can process large datasets to find relationships that would not be obvious to a human or that could not be done in a reasonable period of time.

There are numerous relationships possible. These include relationships such as where something is located, how two people are related to each other, what are the parts of a system, and who is in charge. Relationship extraction is useful for a number of tasks including building knowledge bases, performing analysis of trends, gathering intelligence, and performing product searches. Finding relationships is sometimes called Text Analytics.

There are several techniques that we can use to perform relationship extractions. These are covered in more detail in Chapter 7, Using a Parser to Extract Relationships. Here, we will illustrate one technique to identify relationships within a sentence using the Stanford NLP StanfordCoreNLP class. This class supports a pipeline where annotators are specified and applied to text. Annotators can be thought of as operations to be performed. When an instance of the class is created, the annotators are added using a Properties object found in the java.util package.

First, create an instance of the Properties class. Then assign the annotators as follows:

Properties properties = new Properties();        
properties.put("annotators", "tokenize, ssplit, parse");

We used three annotators, which specify the operations to be performed. In this case, these are the minimum required to parse the text. The first one, tokenize, will tokenize the text. The ssplit annotator splits the tokens into sentences. The last annotator, parse, performs the syntactic analysis, parsing, of the text.

Next, create an instance of the StanfordCoreNLP class using the properties' reference variable:

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);

Next, an Annotation instance is created, which uses the text as its argument:

Annotation annotation = new Annotation(
    "The meaning and purpose of life is plain to see.");

Apply the annotate method against the pipeline object to process the annotation object. Finally, use the prettyPrint method to display the result of the processing:

pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, System.out);

The output of this code is shown as follows:

Sentence #1 (11 tokens):
The meaning and purpose of life is plain to see.
[Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] [Text=meaning CharacterOffsetBegin=4 CharacterOffsetEnd=11 PartOfSpeech=NN] [Text=and CharacterOffsetBegin=12 CharacterOffsetEnd=15 PartOfSpeech=CC] [Text=purpose CharacterOffsetBegin=16 CharacterOffsetEnd=23 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=24 CharacterOffsetEnd=26 PartOfSpeech=IN] [Text=life CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=NN] [Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=plain CharacterOffsetBegin=35 CharacterOffsetEnd=40 PartOfSpeech=JJ] [Text=to CharacterOffsetBegin=41 CharacterOffsetEnd=43 PartOfSpeech=TO] [Text=see CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=VB] [Text=. CharacterOffsetBegin=47 CharacterOffsetEnd=48 PartOfSpeech=.] 
(ROOT
  (S
    (NP
      (NP (DT The) (NN meaning)
        (CC and)
        (NN purpose))
      (PP (IN of)
        (NP (NN life))))
    (VP (VBZ is)
      (ADJP (JJ plain)
        (S
          (VP (TO to)
            (VP (VB see))))))
    (. .)))

root(ROOT-0, plain-8)
det(meaning-2, The-1)
nsubj(plain-8, meaning-2)
conj_and(meaning-2, purpose-4)
prep_of(meaning-2, life-6)
cop(plain-8, is-7)
aux(see-10, to-9)
xcomp(plain-8, see-10)

The first part of the output displays the text along with the tokens and POS. This is followed by a tree-like structure showing the organization of the sentence. The last part shows relationships between the elements at a grammatical level. Consider the following example:

prep_of(meaning-2, life-6)

This shows how the preposition, "of", is used to relate the words "meaning" and "life". This information is useful for many text simplification tasks.

Using combined approaches

As suggested earlier, NLP problems often involve using more than one basic NLP task. These are frequently combined in a pipeline to obtain the desired results. We saw one use of a pipeline in the previous section, Extracting relationships.

Most NLP solutions will use pipelines. We will provide several examples of pipelines in Chapter 8, Combined Approaches.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image