Feature preparation
In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods for feature extraction as well as their implementation on Apache Spark. All the techniques discussed there can be applied to our datasets here, especially the ones of utilizing time series to create new features.
As mentioned earlier, for this project, we have a target categorical variable of student attrition and a lot of data on demographics, behavior, performance, as well as interventions. The demographic data is almost ready to be used but needs to be merged with the following table for a partial list of the features:
FEATURE NAME |
Description |
---|---|
|
These are the average ACT scores |
|
This is the age |
|
This is the student's county unemployment rate |
|
This is a first-generation student indicator using the "Y/N" options |
|
This is the high school GPA |
|
This is an indicator of the type of high school |
... |