Exploring the data with standard statistics
Now that we have a compass from the decision tree, let's explore the data in order to get more insights that might help us to better filter the data. You can find this content in Chapter11/Exploration.ipynb
.
How to do it…
- We start, as usual, with the necessary imports:
import gzip import pickle import random import numpy as np import matplotlib.pyplot as plt import pandas as pd from pandas.plotting import scatter_matrix %matplotlib inline
Â
Â
- Then we load the data. We will use pandas to navigate it:
fit = np.load(gzip.open('balanced_fit.npy.gz', 'rb')) ordered_features = np.load(open('ordered_features', 'rb')) num_features = len(ordered_features) fit_df = pd.DataFrame(fit, columns=ordered_features + ['pos', 'error']) num_samples = 80 del fit
- Let's ask pandas to show an histogram of all annotations:
fig,ax = plt.subplots(figsize=(16,9)) fit_df.hist(column=ordered_features, ax=ax)
The following histogram is generated:
Histogram of all annotations for a dataset...