Generating a synthetic dataset for text classification problems
In this recipe, we will generate a synthetic dataset for a binary text classification problem. The dataset to be generated in this recipe has two primary fields: the text field containing a statement in string format and the target label that specifies whether the text is POSITIVE
or NEGATIVE
.
In Figure 8.2, we can see that the sentences with the POSITIVE
tag have the __label__positive
label while the sentences with the NEGATIVE
tag have the __label__negative
label. We will use this dataset to train and deploy a BlazingText model in the next recipes to solve a sentiment analysis requirement.
Getting ready
A SageMaker Studio notebook running the Python 3 (Data Science) kernel is the only prerequisite for this recipe.
How to do it…
The first steps in this recipe focus on generating a list of POSITIVE
and NEGATIVE...