Most of the time, text data cannot be used as it is. This is because the presence of various unknown symbols or links makes it dirty or unfit for use. Data cleaning is the art of extracting meaningful portions from data by eliminating unnecessary details. Consider the sentence, He tweeted, 'Live coverage of General Elections available at this.tv/show/ge2019. _/\_ Please tune in :) '.
Various symbols, such as "_/\_" and ":)," are present in the sentence. They do not contribute much to its meaning. We need to remove such unwanted details. This is done not only to focus more on the actual content but also to reduce computations. To achieve this, methods such as tokenization and stemming are used. We will learn about them one by one in the upcoming sections.
Tokenization
Tokenization and word tokenizers were briefly described in Chapter 1, Introduction to Natural Language Processing. Tokenization is the process of splitting sentences...