Data representation in scikit-learn
In contrast to the heterogeneous domains and applications of machine learning, the data representation in scikit-learn is less diverse, and the basic format that many algorithms expect is straightforward—a matrix of samples and features.
The underlying data structure is a numpy
and the ndarray
. Each row in the matrix corresponds to one sample and each column to the value of one feature.
There is something like Hello World
in the world of machine learning datasets as well; for example, the Iris dataset whose origins date back to 1936. With the standard installation of scikit-learn, you already have access to a couple of datasets, including Iris that consists of 150 samples, each consisting of four measurements taken from three different Iris flower species:
>>> import numpy as np >>> from sklearn import datasets >>> iris = datasets.load_iris()
The dataset is packaged as a bunch, which is only a thin wrapper around a dictionary:
...