Search engines do a pretty good job of meeting the needs of most users. People frequently use search engines to find the address of a business or movie showtimes. A word-processor can perform a simple search to locate a specific word or phrase in a text. However, this task can get more complicated when we need to consider other factors, such as whether synonyms should be used or whether we are interested in finding things closely related to a topic.
For example, let's say we visit a website because we are interested in buying a new laptop. After all, who doesn't need a new laptop? When you go to the site, a search engine will be used to find laptops that possess the features you are looking for. The search is frequently conducted based on a previous analysis of vendor information. This analysis often requires text to be processed in order to derive useful information that can eventually be presented to a customer.
The presentation may be in the form of facets. These are normally displayed on the left-hand side of a web page. For example, the facets for laptops might include categories such as Ultrabook, Chromebook, or Hard Disk Size. This is illustrated in the following screenshot, which is part of an Amazon web page:
Some searches can be very simple. For example, the String class and related classes have methods, such as the indexOf and lastIndexOf methods, that can find the occurrence of a String class. In the simple example that follows, the index of the occurrence of the target string is returned by the indexOf method:
String text = "Mr. Smith went to 123 Washington avenue.";
String target = "Washington";
int index = text.indexOf(target);
System.out.println(index);
The output of this sequence is shown here:
22
This approach is useful for only the simplest problems.
When text is searched, a common technique is to use a data structure called an inverted index. This process involves tokenizing the text and identifying terms of interest in the text along with their position. The terms and their positions are then stored in the inverted index. When a search is made for the term, it is looked up in the inverted index and the positional information is retrieved. This is faster than searching for the term in the document each time it is needed. This data structure is used frequently in databases, information-retrieval systems, and search engines.
More sophisticated searches might involve responding to queries such as: "What are some good restaurants in Boston?" To answer this query, we might need to perform entity-recognition/resolution to identify the significant terms in the query, perform semantic analysis to determine the meaning of the query, search, and then rank the candidate responses.
To illustrate the process of finding names, we use a combination of a tokenizer and the OpenNLP TokenNameFinderModel class to find names in a text. Since this technique may throw IOException, we will use a try...catch block to handle it. Declare this block and an array of strings holding the sentences, as shown here:
try {
String[] sentences = {
"Tim was a good neighbor. Perhaps not as good a Bob " +
"Haywood, but still pretty good. Of course Mr. Adam " +
"took the cake!"};
// Insert code to find the names here
} catch (IOException ex) {
ex.printStackTrace();
}
Before the sentences can be processed, we need to tokenize the text. Set up the tokenizer using the Tokenizer class, as shown here:
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
We will need to use a model to detect sentences. This is needed to avoid grouping terms that may span sentence boundaries. We will use the TokenNameFinderModel class based on the model found in the en-ner-person.bin file. An instance of TokenNameFinderModel is created from this file as follows:
TokenNameFinderModel model = new TokenNameFinderModel(
new File("C:\\OpenNLP Models", "en-ner-person.bin"));
The NameFinderME class will perform the actual task of finding the name. An instance of this class is created using the TokenNameFinderModel instance, as shown here:
NameFinderME finder = new NameFinderME(model);
Use a for-each statement to process each sentence, as shown in the following code sequence. The tokenize method will split the sentence into tokens and the find method returns an array of Span objects. These objects store the starting and ending indexes for the names identified by the find method:
for (String sentence : sentences) {
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = finder.find(tokens);
System.out.println(Arrays.toString(
Span.spansToStrings(nameSpans, tokens)));
}
When executed, it will generate the following output:
[Tim, Bob Haywood, Adam]
The primary focus of Chapter 4, Finding People and Things, is name recognition.