Data pull and pre-processing
Once the crawling is finished, we have all the data in the MongoDB database. We can now query the database to put all the posts into a pandas dataframe:
import pandas as pd from pymongo import MongoClient client = MongoClient('HOST:PORT') db = client.teamspeed collection = db.forum_teamspeed dataset = [] for element in collection.find(): dataset.append(element) df = pd.DataFrame(dataset)
At this stage, we will also create a new column called full_verbatim
, where we concatenate the subject (thread title) and post content:
df['full_verbatim'] = df.apply(lambda x: x['subject'] + " " + x['post'],axis=1)
There exists a direct link between thread title and post, so the textual data included in both variables might be insightful with respect to a single thought of the forum user. It will help us to capture the broader and contextual meaning of the ideas expressed in forum posts.
Data cleaning
Thereafter, as seen in the earlier chapters, we need to clean and structure...