Preparing data
An important step in NLP is finding and preparing data for processing. This includes data for training purposes and the data that needs to be processed. There are several factors that need to be considered. Here, we will focus on the support Java provides for working with characters.
We need to consider how characters are represented. Although we will deal primarily with English text, other languages present unique problems. Not only are there differences in how a character can be encoded, the order in which text is read will vary. For example, Japanese orders its text in columns going from right to left.
There are also a number of possible encodings. These include ASCII, Latin, and Unicode to mention a few. A more complete list is found in the following table. Unicode, in particular, is a complex and extensive encoding scheme:
Encoding |
Description |
---|---|
ASCII |
A character encoding using 128 (0-127) values. |
Latin |
There are several Latin variations that uses 256 values. They include various combination of the umlaut, such as , and other characters. Various versions of Latin have been introduced to address various Indo-European languages, such as Turkish and Esperanto. |
Big5 |
A two-byte encoding to address the Chinese character set. |
Unicode |
There are three encodings for Unicode: UTF-8, UTF-16, and UTF-32. These use 1, 2, and 4 bytes, respectively. This encoding is able to represent all known languages in existence today, including newer languages such as Klingon and Elvish. |
Java is capable of handling these encoding schemes. The javac
executable's –encoding
command-line option is used to specify the encoding scheme to use. In the following command line, the Big5
encoding scheme is specified:
javac –encoding Big5
Character processing is supported using the primitive data type char
, the Character
class, and several other classes and interfaces as summarized in the following table:
Character type |
Description |
---|---|
|
Primitive data type. |
|
Wrapper class for |
|
This class support a buffer of |
|
An interface implemented by |
Java also provides a number of classes and interfaces to support strings. These are summarized in the following table. We will use these in many of our examples. The String
, StringBuffer
, and StringBuilder
classes provide similar string processing capabilities but differ in whether they can be modified and whether they are thread-safe. The CharacterIterator
interface and the StringCharacterIterator
class provide techniques to traverse character sequences.
The Segment
class represents a fragment of text.
Class/Interface |
Description |
---|---|
|
An immutable string. |
|
Represents a modifiable string. It is thread-safe. |
|
Compatible with the |
|
Represents a fragment of text in a character array. It provides rapid access to character data in an array. |
|
Defines an iterator for text. It supports bidirectional traversal of text. |
|
A class that implements the |
We also need to consider the file format if we are reading from a file. Often data is obtained from sources where the words are annotated. For example, if we use a web page as the source of text, we will find that it is marked up with HTML tags. These are not necessarily relevant to the analysis process and may need to be removed.
The Multi-Purpose Internet Mail Extensions (MIME) type is used to characterize the format used by a file. Common file types are listed in the following table. Either we need to explicitly remove or alter the markup found in a file or use specialized software to deal with it. Some of the NLP APIs provide tools to deal with specialized file formats.
File format |
MIME type |
Description |
---|---|---|
Text |
plain/text |
Simple text file |
Office Type Document |
application/msword application/vnd.oasis.opendocument.text |
Microsoft Office Open Office |
|
application/pdf |
Adobe Portable Document Format |
HTML |
text/html |
Web pages |
XML |
text/xml |
eXtensible Markup Language |
Database |
Not applicable |
Data can be in a number of different formats |
Many of the NLP APIs assume that the data is clean. When it is not, it needs to be cleaned lest we get unreliable and misleading results.