In the first place, we generate wordclouds for most frequent keywords for posts and consumer comments on the whole dataset.
In the following screenshot, you can see the most frequent keywords in brand posts:
In the following screenshot, you can see the most frequent keywords used in comments:
We can easily notice that the keywords are polluted by lots of comments related to political and religious issues. As we don't want to focus our analysis on these topics, we'll create a filtering method to remove all the irrelevant words.
We define a list of keywords associated with comments considered as noise in a global variable, CLEANING_LST. Our list can be also saved in a file and loaded to the variable:
CLEANING_LST = ['gulf','d','ban','persic' ...]
Cleaning irrelevant words is an iterative process and you can add any other...