Before diving into the machine learning (ML) problems in text classification, we will take a look at the different open datasets that are available on the internet. Many of the classification tasks may require large labeled text data. This data can be broadly grouped into those with binary classes, multi-classes, and multi-labels. The following are some of the popular datasets used for benchmarking in both research and some competitions, such as Kaggle:
Dataset name
|
Class type
|
Source
|
|
1 |
IMDb movie Dataset |
Binary classes |
|
2 |
Twitter Sentiment Analysis Dataset |
Binary classes |
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/ |
3 |
YouTube Spam Collection Dataset |
Binary classes |
https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection | ...