Data and feature preparation
In the section Feature extraction of Chapter 2, Data Preparation for Spark ML, we have reviewed a few methods for feature extraction, and discussed their implementation in Apache Spark. All the techniques discussed there can be applied to the risk scoring project here.
For this project, as mentioned earlier, the main concern is to get everything organized as workflows for repeatability, and possibly automation. So we will adopt OpenRefine for data and feature preparation. We will use OpenRefine within the DataScientistWorkbench environment where it has been integrated.
OpenRefine
OpenRefine, formerly Google Refine, is an open source application for data cleaning.
To use OpenRefine, please go to: https://datascientistworkbench.com/
After logging in, you will see the following screen:
Then, please click on the OpenRefine button on the upper-right corner of the screen:
Here, you can import datasets from your computer or from a URL address.
Then you can create an OpenRefine...