Summary
In this chapter, you surveyed the universe with the Exoplanet dataset to discover new planets, and potentially new life. You built multiple XGBClassifiers to predict when exoplanet stars are the result of periodic changes in light. With only 37 exoplanet stars and 5,050 non-exoplanet stars, you corrected the imbalanced data by undersampling, oversampling, and tuning XGBoost hyperparameters including scale_pos_weight
.
You analyzed results using the confusion matrix and the classification report. You learned key differences between various classification scoring metrics, and why for the Exoplanet dataset accuracy is virtually worthless, while a high recall is ideal, especially when combined with high precision for a good F1 score. Finally, you realized the limitations of machine learning models when the data is extremely varied and imbalanced.
After this case study, you have the necessary background and skills to fully analyze imbalanced datasets with XGBoost using scale_pos_weight...