Preprocessing text data
As we know, it is common to use URLs, user mentions, and hashtags frequently on Twitter. Thus, first we need to preprocess the tweets as follow.
Ensure that all the tokens are separated using the space. Each tweet is lowercased.
The URLs, user mentions, and hashtags are replaced by the <url>
, <user>
, and <hashtag>
tokens respectively. This step is done using the process
function, it takes a tweet as input, tokenizes it using the NLTK TweetTokenizer
, preprocesses it, and returns the set of words (token) in the tweet:
import re from nltk.tokenize import TweetTokenizer def process(tweet): tknz = TweetTokenizer() tokens = tknz.tokenize(tweet) tweet = " ".join(tokens) tweet = tweet.lower() tweet = re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', '<url>', tweet) # URLs tweet = re.sub(r'(?:@[\w_]+)', '<user>', tweet) # user-mentions tweet...