NLP is not easy. There are several factors that make this process hard. For example, there are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context. Here, we will examine a few of the more significant problem areas.
At the character level, there are several factors that need to be considered. For example, the encoding scheme used for a document needs to be considered. Text can be encoded using schemes such as ASCII, UTF-8, UTF-16, or Latin-1. Other factors, such as whether the text should be treated as case-sensitive or not, may need to be considered. Punctuation and numbers may require special processing. We sometimes need to consider the use of emoticons (character combinations and special character images), hyperlinks, repeated punctuation (... or ---), file extensions, and usernames with embedded periods. Many of these are handled by preprocessing text, as we will discuss in the Preparing data section.
When we tokenize text, it usually means we are breaking up the text into a sequence of words. These words are called tokens. The process is referred to as tokenization. When a language uses whitespace characters to delineate words, this process is not too difficult. With a language such as Chinese, it can be quite difficult since it uses unique symbols for words.
Words and morphemes may need to be assigned a Part-of-Speech (POS) label, identifying what type of unit it is. A morpheme is the smallest division of text that has meaning. Prefixes and suffixes are examples of morphemes. Often, we need to consider synonyms, abbreviation, acronyms, and spellings when we work with words.
Stemming is another task that may need to be applied. Stemming is the process of finding the word stem of a word. For example, words such as walking, walked, or walks have the word stem walk. Search engines often use stemming to assist in asking a query.
Closely related to stemming is the process of lemmatization. This process determines the base form of a word, called its lemma. For example, for the word operating, its stem is oper but its lemma is operate. Lemmatization is a more refined process than stemming, and uses vocabulary and morphological techniques to find a lemma. This can result in more precise analysis in some situations.
Words are combined into phrases and sentences. Sentence detection can be problematic and is not as simple as looking for the periods at the end of a sentence. Periods are found in many places, including abbreviations such as Ms., and in numbers such as 12.834.
We often need to understand which words in a sentence are nouns and which are verbs. We are often concerned with the relationship between words. For example, coreferences resolution determines the relationship between certain words in one or more sentences. Consider the following sentence:
"The city is large but beautiful. It fills the entire valley."
The word it is the coreference to city. When a word has multiple meanings, we might need to perform word-sense disambiguation (WSD) to determine the intended meaning. This can be difficult to do at times. For example, "John went back home." Does the home refer to a house, a city, or some other unit? Its meaning can sometimes be inferred from the context in which it is used. For example, "John went back home. It was situated at the end of a cul-de-sac."
Summarization is the process of producing a short description of different units. These units can include multiple sentences, paragraphs, a document, or multiple documents. The intent may be to identify those sentences that convey the meaning of the unit, determine the prerequisites for understanding a unit, or to find items within these units. Frequently, the context of the text is important in accomplishing this task.