Text Analysis Is All You Need
In this chapter, we will learn how to analyze text data and create machine learning models to help us. We will use the Jigsaw Unintended Bias in Toxicity Classification dataset (see Reference 1). This competition had the objective of building models that detect toxicity and reduce unwanted bias toward minorities that might be wrongly associated with toxic comments. With this competition, we introduce the field of Natural Language Processing (NLP).
The data used in the competition originates from the Civil Comments platform, which was founded by Aja Bogdanoff and Christa Mrgan in 2015 (see Reference 2) with the aim of solving the problem of civility in online discussions. When the platform was closed in 2017, they chose to keep around 2 million comments for researchers who want to understand and improve civility in online conversations. Jigsaw was the organization that sponsored this effort and then started a competition for language toxicity classification...