Examining data statisticsÂ
When Amazon ML created the data source, it carried out a basic statistical analysis of the different variables. For each variable, it estimated the following information:
- Correlation of each attribute to the target
- Number of missing values
- Number of invalid values
- Distribution of numeric variables with histogram and box plotÂ
- Range, mean, and median for numeric variables
- Most and least frequent categories for categorical variables
- Word counts for text variables
- Percentage of true values for binary variables
Go to the Datasource dashboard, and click on the new datasource you just created in order to access the data summary page. The left side menu lets you access data statistics for the target and different attributes, grouped by data types. The following screenshot shows data insights for the Numeric
attributes. The age
and fare
 variables are worth looking at more closely:
Two things stand out:
age
has20%
missing values. We should replace these missing values by the mean...