For this example, we will use the dataset of references to news web pages collected by a news aggregator. There are four categories in the dataset belonging to the news of science and technology, business, entertainment, and health. The complete Jupyter Notebook for this example can be found under the Chapter05/03_example.ipynb directory in this book's code repository.
We will first look at the sample of the data from this dataset:
news_df = pd.read_csv('data/newsCorpora.csv',delimiter='\t', header=None,
names=['ID','TITLE','URL','PUBLISHER','CATEGORY','STORY','HOSTNAME','TIMESTAMP'])
news_df = news_df.sample(frac=1.0)
news_df.head(5)
The dataset is represented in the table format as follows:
ID
|
TITLE
|
CATEGORY
|
---|---|---|
225897... |