Synthetic data generation for classification problems
In this recipe, we will generate a synthetic dataset using scikit-learn. This dataset will serve as a dummy dataset for the classification problems in this chapter. This dataset has only three columns—label
, a
, and b
. In Figure 5.3, we have a scatterplot diagram of the dataset showing the two groups of points grouped by their label
values:
We will divide this dataset into training, validation, and test datasets with a train-test split and upload these to an Amazon S3 bucket. Once we have them ready, we can run ML experiments while working with SageMaker Debugger and SageMaker Experiments in the following recipes in this chapter.
Tip
Since we will show the steps on how to generate a synthetic dataset in this recipe, we will have the opportunity to tweak this recipe later on to fit our needs. We can decide to make this dataset...