Cleaning tweets
New constraints lead to new forms. Twitter is no exception in this regard. Because text has to fit into 140 characters, people naturally develop new language shortcuts to say the same in less characters. So far, we have ignored all the diverse emoticons and abbreviations. Let's see how much we can improve by taking that into account. For this endeavor, we will have to provide our own preprocessor()
to TfidfVectorizer
.
First, we define a range of frequent emoticons and their replacements in a dictionary. Although we could find more distinct replacements, we go with obvious positive or negative words to help the classifier:
emo_repl = { # positive emoticons "<3": " good ", ":d": " good ", # :D in lower case ":dd": " good ", # :DD in lower case "8)": " good ", ":-)": " good ", ":)": " good ", ";)": " good ", "(-:": " good ", "(:": " good ", # negative emoticons: ":/": " bad ", ":>": " sad ", ":')": " sad "...