7. Topic Modeling
Activity 7.01: Loading and Cleaning Twitter Data
Solution:
- Import the necessary libraries:
import warnings warnings.filterwarnings('ignore') import langdetect import matplotlib.pyplot import nltk nltk.download('wordnet') nltk.download('stopwords') import numpy import pandas import pyLDAvis import pyLDAvis.sklearn import regex import sklearn
- Load the LA Times health Twitter data (
latimeshealth.txt
) from https://packt.live/2Xje5xF.Note
Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.
The code looks as follows:
path = 'latimeshealth.txt' df = pandas.read_csv(path, sep="|", header=None) df.columns = ["id", "datetime", "tweettext"]
- Run a quick exploratory analysis to ascertain the data size and structure:
def dataframe_quick_look(df, nrows): print("SHAPE:\n{shape}\n".format(shape...