Getting familiar with your data
Although we would strongly discourage such behavior, you can build a model without knowing your data; it will most likely take you longer, and the quality of the resulting model might be less than optimal, but it is doable.
Note
In this section, we will use the dataset we downloaded from http://packages.revolutionanalytics.com/datasets/ccFraud.csv. We did not alter the dataset itself, but it was GZipped and uploaded to http://tomdrabas.com/data/LearningPySpark/ccFraud.csv.gz. Please download the file first and save it in the same folder that contains your notebook for this chapter.
The head of the dataset looks as follows:
Thus, any serious data scientist or data modeler will become acquainted with the dataset before starting any modeling. As a first thing, we normally start with some descriptive statistics to get a feeling for what we are dealing with.
Descriptive statistics
Descriptive statistics, in the simplest sense, will tell you the basic information about...