Hands-on exercise
For this example, I used a dataset created by Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith in their paper, The Multilingual Amazon Reviews Corpus, in the proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, https://arxiv.org/abs/2010.02573. The data files include thousands of reviews of Amazon products.
This example will process the reviews to find out which nouns are more frequent in positive reviews and which ones are present in negative ones. While this is quite a basic approach, it could be the first iteration of an application to provide information about which products our company should focus on and which ones we should consider dropping.
To create the input file for our exercise, I downloaded all six JSON files from the /dev
directory of the dataset, available at https://docs.opendata.aws/amazon-reviews-ml/readme.html, which requires an AWS account to access it. I created a small Python script...