Now that we have completed our discussion on feature engineering, the next step is to obtain a dataset. For some problems, this can be very difficult. For instance, when attempting to predict something that no one else has done, or that is in an emerging sector, having a training set to train on would be more difficult than say, finding malicious files for our previous example.
Another aspect to consider is diversity and how the data is broken out. For instance, consider how you would predict malicious Android applications based on behavioral analysis using the anomaly detection trainer that ML.NET provides. When thinking about building your dataset, most Android users, I would argue, do not have half of their apps as malicious. Therefore, an even malicious and benign (50/50) breakdown of training and test sets might be over-fitting on malicious applications. Figuring out and analyzing the actual representation of what your target users will encounter...