This is an introductory post on scikit-learn where we will learn basic terminology and functionality of this amazing Python package. We will also explore basic principles of machine learning and how machine learning can be done with sklearn.
numpy + pandas + sklearn
Dependencies
Installation
Mac - pip install -U numpy scipy scikit-learn
Linux - sudo apt-get install build-essential python-dev python-setuptools python-numpy python-scipy libatlas-dev libatlas3gf-base
After you have installed sklearn and all its dependencies, you are ready to dive further.
Input data
Most machine learning algorithms implemented in sklearn expect the input data in the form of a numpy array of shape [nSamples, nFeatures].
nSamples is the number of samples in the data. Each sample is an observation or an instance of the data. A sample can be a text document, a picture, a row in a database or a csv file – anything you can describe with a fixed set of quantitative traits.
nFeatures is the number of features or distinct traits that describe each sample quantitatively. Features can be real-valued, boolean or discrete.
The data can be very high dimensional, such as with hundreds of thousands of features, and it can be sparse, such as most of the features values are zero.
Example
As an example, we will look at the Iris dataset, which comes with sklearn and every other ML package that I know of!
from sklearn.datasets import load_iris
iris = load_iris()
input = iris.data
output = iris.target
What are the number of samples and features in this dataset ?
Since the input data is a numpy array, we can access its shape using the following:
nSamples = input.shape[0]
nFeatures = input.shape[1]
>> nSamples = 150
>> nFeatures = 4
This dataset has 150 samples, where each sample has 4 features. Let's look at the names of the target output:
iris.target_names
>> array(['setosa','versicolor', 'virginica'], dtype='|S10')
To get a better idea of the data, let's look at a sample:
input[0]
>> array([5.1, 3.5, 1.4, 0.2])
output[0]
>> 0
The data is given as a numpy array of shape (150,4) which consists of the measurements of physical traits for three species of irises. The features include:
The target values {0,1,2} denote three species:
Here is the basic idea of machine learning.
The basic setting for a supervised machine learning model is as follows:
Supervised learning is further broken down into two categories: classification and regression.
There are various machine learning methods that can be used to build a supervised learning model, for example decision trees, k-nearest neighbors, SVM, linear and logistic regression, random forests, and more. I'll not talk about these methods and their differences in this post. I will give an illustration of using sklearn for predictive modeling using a regression and a classification model.
Iris Example continued (Clasification):
We saw that data is a numpy array of shape (150,4) consisting of measurements of physical traits for three iris species.
Goal
The task is to build a machine learning model to predict the species of a sample given the values of the features.
We will split the iris set into a training and a test set. The model will be built on a training set and evaluated on the test set. Before we do that, let's look at the general outline of a machine learning model in sklearn.
Outline of sklearn models:
The basic outline of a sklearn model is given by the following pseudocode.
input = labeled data
X_train = input.features
Y_train = input.target
algorithm = sklearn.ClassImplementingTheAlgorithm(parameters of the algorithm)
fitting = algorithm.fit(X_train, Y_train)
X_test = unlabeled set
prediction = algorithm.predict(X_test)
Here, as before, the labeled training data is in the form of a numpy array with X_train as the array of feature values and Y_train as the corresponding target values. In sklearn, different machine learning algorithms are implemented as classes and we will choose the class corresponding to the algorithm we want to use. Each class has a method called fit which fits the input training data to estimate the parameters of the algorithm. Now with these estimated parameters, the predict method computes the estimated value of the target for the test examples.
sklearn model on iris data:
Following the general outline of the sklearn model, we will now build a model on iris data to predict the species.
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
Y = iris.target
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X,Y, test_size=0.4)
from sklearn.neighbors import KNeighborsClassifier
algorithm = KNeighborsClassifier(n_neighbors=5)
fitting = algorithm.fit(X_train, Y_train)
prediction = algorithm.predict(X_test)
The iris data set is split into a training and a test set using a cross validation class from sklearn. The 60% of the iris data was formed and the remaining 40% was the test. The cross_validation picks training and test examples randomly. We used the K-nearest neighbor algorithm to build this model. There is no reason for choosing this method, other than simplicity. The prediction of the sklearn model is a label from {0,1,2} for each of the test case.
Let's check how well this model performed:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, prediction)
>> 0.97
Regression:
We will discuss the simplest example of fitting a line through the data.
# Create some simple data
import numpy as np
np.random.seed(0)
X = np.random.random(size=(20, 1))
y = 3 * X.squeeze() + 2 + np.random.normal(size=20)
# Fit a linear regression to it
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print ("Model coefficient: %.5f, and intercept: %.5f"% (model.coef_, model.intercept_))
>> Model coefficient: 3.93491, and intercept: 1.46229
# model prediction
X_test = np.linspace(0, 1, 100)[:, np.newaxis]
y_test = model.predict(X_test)
Thus we get the values of the target (which were continous).
We gave a simple model based on sklearn implementation of K-Nearest neighbor algorithm and linear regression. You can try other models. The python code will be same for most of the methods in sklearn, except for a change in the name of the algorithm.
Discovert more Machine Learning content and tutorials on our dedicated Machine Learning page.
Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.
He holds a Masters in Theoretical Physics from University of Cambridge in UK, and he dropped out from mathematics PhD program (after 3 years) at Kansas State University.
He has held research positions at Indian Statistical Institute – Delhi, Tata Institute of Fundamental Research – Mumbai and at JN Center for Advanced Scientific Research – Bangalore.
He is a voracious reader and an avid traveler. He hangs out at the local coffee shops, which serve as his office away from office. He writes about data science, machine learning and mathematics at Random Inferences.