Cleaning Text Data
A key component of all successful modeling exercises is a clean dataset that has been appropriately and sufficiently preprocessed for the specific data type and analysis being performed. Text data is no exception, as it is virtually unusable in its raw form. It does not matter what algorithm is being run: if the data isn't properly prepared, the results will be at best meaningless and at worst misleading. As the saying goes, "garbage in, garbage out." For topic modeling, the goal of data cleaning is to isolate the words in each document that could be relevant by removing everything that could be obstructive.
Data cleaning and preprocessing is almost always specific to the dataset, meaning that each dataset will require a unique set of cleaning and preprocessing steps selected to specifically handle the issues in the dataset being worked on. With text data, cleaning and preprocessing steps can include language filtering, removing URLs and screen names, lemmatizing, and stop...