Understanding text datasets – loading, managing, and visualizing the Enron Email dataset
Another field that has grown considerably in DL in recent years is natural language processing (NLP). Similarly to CV, this field aims to surpass human performance in real-world datasets.
In this recipe, we will explore one of the simplest NLP tasks: text classification. Given a set of sentences and paragraphs, our task is to correctly classify that text among a given set of labels (classes).
One of the most classic text classification tasks is to distinguish whether received email is spam or not (ham). These datasets are binary text classification datasets (only two labels to assign, 0
and 1
, or ham
and spam
).
In our specific scenario, we will use a real-world email dataset. This set of emails was made public during the investigation of the Enron scandal in the early 2000s by the US Government. This dataset was first published in 2004 and is composed of emails from ~150 users,...