Cleaning and stemming text variables
Some variables in our dataset come from free text fields, which are manually completed by users. People have different writing styles, and we use a variety of punctuation marks, capitalization patterns, and verb conjugations to convey the content, as well as the emotions surrounding it. We can extract (some) information from text without taking the trouble to read it by creating statistical parameters that summarize the text’s complexity, keywords, and relevance of words in a document. We discussed these methods in the previous recipes of this chapter. However, to derive these statistics and aggregated features, we should clean the text variables first.
Text cleaning or preprocessing involves punctuation removal, stop word elimination, character case setting, and word stemming. Punctuation removal consists of deleting characters that are not letters, numbers, or spaces; in some cases, we also remove numbers. The elimination of stop words...