New constraints lead to new forms. Twitter is no exception in this regard. Because the text has to fit into 280 characters, people naturally develop new language shortcuts to say the same in fewer characters. So far, we have ignored all the diverse emoticons and abbreviations. Let's see how much we can improve by taking that into account. For this endeavor, we will have to provide our own preprocessor() to TfidfVectorizer.
First, we define a range of frequent emoticons and their replacements in a dictionary. Although we can find more distinct replacements, we go with obvious positive or negative words to help the classifier:
emo_repl = {
# positive emoticons
"<3": " good ",
":d": " good ", # :D in lower case
":dd": " good ", # :DD in lower case
"8)": "...