What is in the data?
The data from the Jigsaw Unintended Bias in Toxicity Classification competition dataset contains 1.8 million rows in the training set and 97,300 rows in the test set. The test data contains only a comment column and does not contain a target (the value to predict) column. Training data contains, besides the comment column, another 43 columns, including the target feature. The target is a number between 0 and 1, which represents the annotation that is the objective of the prediction for this competition. This target value represents the degree of toxicity of a comment (0
means zero/no toxicity and 1
means maximum toxicity), and the other 42 columns are flags related to the presence of certain sensitive topics in the comments. The topic is related to five categories: race and ethnicity, gender, sexual orientation, religion, and disability. In more detail, these are the flags per each of the five categories:
- Race and ethnicity: