Exploratory data analysis
First, we want to see how many individuals of each class we have. This is important, because if the class distribution is very imbalanced (like 1 to 100, for example), we will have problems training our classification models. You can get data frame columns via the dot notation. For example, df.label
will return you the label column as a new data frame. The data frame class has all kinds of useful methods for calculating the summary statistics. The value_counts()
method returns the counts of each element type in the data frame:
In []: df.label.value_counts() Out[]: platyhog 520 rabbosaurus 480 Name: label, dtype: int64
The class distribution looks okay for our purposes. Now let's explore the features.
We need to group our data by classes, and calculate feature statistics separately to see the difference between the creature classes. This can be done using the groupby()
method. It takes the label of the column by which you want to group your data:
In [...