Real-world text datasets
The Kaggle website is an online community platform for data scientists and machine learning enthusiasts. The Kaggle website has thousands of real-world datasets; Pluto found a little over 2,900 NLP datasets and has selected two NLP datasets for this chapter.
In Chapter 2, Pluto uses the Netflix and Amazon datasets as examples with which to understand biases. Pluto keeps the Netflix NLP dataset because the movie reviews are curated . There are a few syntactical errors, but overall, the input texts are of high quality.
The second NLP dataset is Twitter Sentiment Analysis (TSA). The 29,530 real-world tweets contain many grammatical errors and misspelled words. The challenge is to classify the tweets into two categories: (1) normal or (2) racist and sexist.
The dataset was published in 2021 by Mayur Dalvi, and the license is CC0: Public Domain, https://creativecommons.org/publicdomain/zero/1.0/.
After selecting the two NLP datasets, you can use the...