Tokenization is the process of breaking unstructured text such as paragraphs, sentences, or phrases down into a list of text values called tokens. A token is the lowest unit used by NLP functions to help to identify and work with the data. The process creates a natural hierarchy to help to identify the relationship from the highest to the lowest unit. Depending on the source data, the token could represent a word, sentence, or individual character.
The process to tokenize a body of text, sentence, or phrase, typically starts with breaking apart words using the white space in between them. However, to correctly identify each token accurately requires the library package to account for exceptions such as hyphens, apostrophes, and a language dictionary, to ensure the value is properly identified. Hence, tokenization requires the language of origin of the text to be known to process it. Google Translate, for example, is an NLP solution that can identify...