Natural language processing
Once the text content of a web page has been extracted, the text data is usually preprocessed to remove parts that do not bring any relevant information. A text is tokenized, that is, transformed into a list of words (tokens), and all the punctuation marks are removed. Another usual operation is to remove all the stopwords, that is, all the words used to construct the syntax of a sentence but not containing text information (such as conjunctions, articles, and prepositions) such as a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with, and many others.
Many words in English (or any language) share the same root but have different suffixes or prefixes. For example, the words think, thinking, and thinker all share the same root—think indicating that the meaning is the same—but the role in a sentence is different (verb, noun, and so on). The procedure to reduce all...