[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic developer’s guide for various machine learning algorithms and techniques to develop more efficient and intelligent applications.[/box]
In this article we help you go through a simple implementation of a neural network layer by modeling a binary function using basic python techniques. It is the first step in solving some of the complex machine learning problems using neural networks.
Take a look at the following code snippet to implement a single function with a single-layer perceptron:
import numpy as np
import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') from pprint import pprint
%matplotlib inline
from sklearn import datasets import matplotlib.pyplot as plt
The learning properties of a neural network would not be very good with just the help of a univariate linear classifier. Even some mildly complex problems in machine learning involve multiple non-linear variables, so many variants were developed as replacements for the transfer functions of the perceptron.
In order to represent non-linear models, a number of different non-linear functions can be used in the activation function. This implies changes in the way the neurons will react to changes in the input variables. In the following sections, we will define the main different transfer functions and define and represent them via code.
In this section, we will start using some object-oriented programming (OOP) techniques from Python to represent entities of the problem domain. This will allow us to represent concepts in a much clearer way in the examples.
Let's start by creating a TransferFunction class, which will contain the following two methods:
getTransferFunction(x): This method will return an activation function determined by the class type
getTransferFunctionDerivative(x): This method will clearly return its derivative
For both functions, the input will be a NumPy array and the function will be applied element by element, as follows:
>class TransferFunction:
def getTransferFunction(x): raise NotImplementedError
def getTransferFunctionDerivative(x): raise NotImplementedError
Let's take a look at the following code snippet to see how the transfer function works:
def graphTransferFunction(function): x = np.arange(-2.0, 2.0, 0.01) plt.figure(figsize=(18,8)) ax=plt.subplot(121)
ax.set_title(function. name )
plt.plot(x, function.getTransferFunction(x))
ax=plt.subplot(122)
ax.set_title('Derivative of ' + function. name ) plt.plot(x, function.getTransferFunctionDerivative(x))
A sigmoid or logistic function is the canonical activation function and is well-suited for calculating probabilities in classification properties. Firstly, let's prepare a function that will be used to graph all the transfer functions with their derivatives, from a common range of
-2.0 to 2.0, which will allow us to see the main characteristics of them around the y axis.
The classical formula for the sigmoid function is as follows:
class Sigmoid(TransferFunction): #Squash 0,1 def getTransferFunction(x):
return 1/(1+np.exp(-x))
def getTransferFunctionDerivative(x): return x*(1-x)
graphTransferFunction(Sigmoid)
Take a look at the following graph:
Next, we will do an exercise to get an idea of how the sigmoid changes when multiplied by the weights and shifted by the bias to accommodate the final function towards its minimum. Let's then vary the possible parameters of a single sigmoid first and see it stretch and move:
ws=np.arange(-1.0, 1.0, 0.2)
bs=np.arange(-2.0, 2.0, 0.2)
xs=np.arange(-4.0, 4.0, 0.1) plt.figure(figsize=(20,10)) ax=plt.subplot(121)
for i in ws:
plt.plot(xs, Sigmoid.getTransferFunction(i *xs),label= str(i)); ax.set_title('Sigmoid variants in w')
plt.legend(loc='upper left');
ax=plt.subplot(122) for i in bs:
plt.plot(xs, Sigmoid.getTransferFunction(i +xs),label= str(i)); ax.set_title('Sigmoid variants in b')
plt.legend(loc='upper left');
Take a look at the following graph:
Let's take a look at the following code snippet:
class Tanh(TransferFunction): #Squash -1,1 def getTransferFunction(x):
return np.tanh(x)
def getTransferFunctionDerivative(x): return np.power(np.tanh(x),2)
graphTransferFunction(Tanh)
Lets take a look at the following graph:
ReLU is called a rectified linear unit, and one of its main advantages is that it is not affected by vanishing gradient problems, which generally consist of the first layers of a network tending to be values of zero, or a tiny epsilon:
class Relu(TransferFunction): def getTransferFunction(x):
return x * (x>0)
def getTransferFunctionDerivative(x): return 1 * (x>0)
graphTransferFunction(Relu)
Let's take a look at the following graph:
Let's take a look at the following code snippet to understand the linear transfer function:
class Linear(TransferFunction): def getTransferFunction(x):
return x
def getTransferFunctionDerivative(x): return np.ones(len(x))
graphTransferFunction(Linear)
Let's take a look at the following graph:
As with every model in machine learning, we will explore the possible functions that we will use to determine how well our predictions and classification went.
The first type of distinction we will do is between the L1 and L2 error function types.
L1, also known as least absolute deviations (LAD) or least absolute errors (LAE), has very interesting properties, and it simply consists of the absolute difference between the final result of the model and the expected one, as follows:
Now it's time to do a head-to-head comparison between the two types of loss function:
Robustness: L1 is a more robust loss function, which can be expressed as the resistance of the function when being affected by outliers, which projects a quadratic function to very high values. Thus, in order to choose an L2 function, we should have very stringent data cleaning for it to be efficient.
Stability: The stability property assesses how much the error curve jumps for a large error value. L1 is more unstable, especially for non-normalized datasets (because numbers in the [-1, 1] range diminish when squared).
Solution uniqueness: As can be inferred by its quadratic nature, the L2 function ensures that we will have a unique answer for our search for a minimum. L2 always has a unique solution, but L1 can have many solutions, due to the fact that we can find many paths with minimal length for our models in the form of piecewise linear functions, compared to the single line distance in the case of L2.
Regarding usage, the summation of the past properties allows us to use the L2 error type in normal cases, especially because of the solution uniqueness, which gives us the required certainty when starting to minimize error values. In the first example, we will start with a simpler L1 error function for educational purposes.
Let's explore these two approaches by graphing the error results for a sample L1 and L2 loss error function. In the next simple example, we will show you the very different nature of the two errors. In the first two examples, we have normalized the input between -1 and 1 and then with values outside that range.
As you can see, from samples 0 to 3, the quadratic error increases steadily and continuously, but with non-normalized data it can explode, especially with outliers, as shown in the following code snippet:
sampley_=np.array([.1,.2,.3,-.4, -1, -3, 6, 3])
sampley=np.array([.2,-.2,.6,.10, 2, -1, 3, -1])
ax.set_title('Sigmoid variants in b') plt.figure(figsize=(10,10)) ax=plt.subplot()
plt.plot(sampley_ - sampley, label='L1') plt.plot(np.power((sampley_ - sampley),2), label="L2") ax.set_title('L1 vs L2 initial comparison') plt.legend(loc='best')
plt.show()
Let's take a look at the following graph:
Let's define the loss functions in the form of a LossFunction class and a getLoss method for the L1 and L2 loss function types, receiving two NumPy arrays as parameters, y_, or the estimated function value, and y, the expected value:
class LossFunction:
def getLoss(y_ , y ):
raise NotImplementedError
class L1(LossFunction): def getLoss(y_, y):
return np.sum (y_ - y) class L2(LossFunction):
def getLoss(y_, y):
return np.sum (np.power((y_ - y),2))
Now it's time to define the goal function, which we will define as a simple Boolean. In order to allow faster convergence, it will have a direct relationship between the first input variable and the function's outcome:
# input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
The first model we will use is a very minimal neural network with three cells and a weight for each one, without bias, in order to keep the model's complexity to a minimum:
# initialize weights randomly with mean 0 W = 2*np.random.random((3,1)) - 1
print (W)
Take a look at the following output generated by running the preceding code:
[[ 0.52014909]
[-0.25361738]
[ 0.165037 ]]
Then we will define a set of variables to collect the model's error, the weights, and training results progression:
errorlist=np.empty(3); weighthistory=np.array(0) resultshistory=np.array(0)
Then it's time to do the iterative error minimization. In this case, it will consist of feeding the whole true table 100 times via the weights and the neuron's transfer function, adjusting the weights in the direction of the error.
Note that this model doesn't use a learning rate, so it should converge (or diverge) quickly:
for iter in range(100):
# forward propagation l0 = X
l1 = Sigmoid.getTransferFunction(np.dot(l0,W)) resultshistory = np.append(resultshistory , l1)
# Error calculation l1_error = y - l1
errorlist=np.append(errorlist, l1_error)
# Back propagation 1: Get the deltas
l1_delta = l1_error * Sigmoid.getTransferFunctionDerivative(l1)
# update weights
W += np.dot(l0.T,l1_delta) weighthistory=np.append(weighthistory,W)
Let's simply review the last evaluation step by printing the output values at l1. Now we can see that we are reflecting quite literally the output of the original function:
print (l1)
Take a look at the following output, which is generated by running the preceding code:
[[ 0.11510625]
[ 0.08929355]
[ 0.92890033]
[ 0.90781468]]
To better understand the process, let's have a look at how the parameters change over time. First, let's graph the neuron weights. As you can see, they go from a random state to accepting the whole values of the first column (which is always right), going to almost 0 for the second column (which is right 50% of the time), and then going to -2 for the third (mainly because it has to trigger 0 in the first two elements of the table):
plt.figure(figsize=(20,20)) print (W)
plt.imshow(np.reshape(weighthistory[1:],(-1,3))[:40], cmap=plt.cm.gray_r,
interpolation='nearest');
Take a look at the following output, which is generated by running the preceding code:
[[ 4.62194116]
[-0.28222595]
[-2.04618725]]
Let's take a look at the following screenshot:
Let's also review how our solutions evolved (during the first 40 iterations) until we reached the last iteration; we can clearly see the convergence to the ideal values:
plt.figure(figsize=(20,20)) plt.imshow(np.reshape(resultshistory[1:], (-1,4))[:40],
cmap=plt.cm.gray_r, interpolation='nearest');
Let's take a look at the following screenshot:
We can see how the error evolves and tends to be zero through the different epochs. In this case, we can observe that it swings from negative to positive, which is possible because we first used an L1 error function:
plt.figure(figsize=(10,10)) plt.plot(errorlist);
Let's take a look at the following screenshot:
The above explanation of implementing neural network using single-layer perceptron helps to create and play with the transfer function and also explore how accurate did the classification and prediction of the dataset took place. To know how classification is generally done on complex and large datasets, you may read our article on multi-layer perceptrons.
To get hands on with advanced concepts and powerful tools for solving complex computational machine learning techniques, do check out this book Machine Learning for Developers and start building smart applications in your machine learning projects.