You're reading from Deep Learning with R Cookbook Over 45 unique recipes to delve into neural network techniques using R 3.5.x

Product type Paperback

Published in Feb 2020

Publisher Packt

ISBN-13 9781789805673

Length 328 pages

Edition 1st Edition

Languages

Tools

H2O

Concepts

Deep Learning

Authors (3):

Swarna Gupta

Rehan Ali Ansari

Dipayan Sarkar

View More author details

Implementing a single-layer neural network

An artificial neural network is a network of computing entities that can perform various tasks, such as regression, classification, clustering, and feature extraction. They are inspired by biological neural networks in the human brain. The most fundamental unit of a neural network is called a neuron/perceptron. A neuron is a simple computing unit that takes in a set of inputs and applies a function to these inputs in order to produce output.

The following diagram shows a simple neuron:

In 1957, Frank Rosenblatt proposed a classical perceptron model in which he associated weight with each input. He also proposed a method to realize these weights. A perceptron model is a simple computing unit with a threshold, , which can be defined by the following equation:

The following diagram represents a perceptron:

Perceptrons can only deal with linearly separable cases. The neural networks that we use today make use of activation functions rather than a harsh threshold, which are used in perceptrons. Unlike perceptrons, neural networks with non-linear activation functions can learn complex non-linear functional mappings between inputs and outputs, making them favorable for more complicated applications such as image recognition, language translation, speech recognition, and so on. The most popular activation functions are sigmoid, tanh, relu, and softmax.

We can implement various machine learning algorithms, such as simple linear regression, logistic regression, and so on, using neural networks. For example, we can think of logistic regression as a single-layer neural network. A logistic regression neural network uses a sigmoid () activation function. The following diagram shows a logistic regression neural network:

The output of the network is given as follows:

where z is equal to

While implementing a multinomial logistic regression problem using neural networks, we place a softmax activation function in the output layer. The following equation shows the output of a multinomial logistic regression neural network:

where z is the weighted sum of inputs for the j^th class

In neural networks, the network error is calculated by comparing the model's output to the desired output. This error term is used to guide the training of neural networks. After each training iteration, the error is communicated backward in the network and the weights of the network are updated in order to minimize the error. This process is called backpropagation. In this recipe, we will build a multi-class classification neural network using the keras library in R.

Getting ready

We will use the iris dataset in this recipe. It is a multivariate dataset that consists of 50 samples that belong to three species of iris flower— setosa, virginica, and versicolor. Each sample contains four feature measurements; that is, the length and width of the sepals and petals in centimeters. We will use the keras package in order to utilize the deep learning functions for classification and the datasets library to import the iris dataset:

library(keras)
library(datasets)

In the next section, we will look at the data in more detail.

How to do it...

Before doing any transformations in the dataset, we will analyze the properties of the data, such as its dimensions, its variables, and its summary:

Let's start by loading the iris dataset from the datasets library:

data <- datasets::iris

Now, we can view the dimensions of the data:

dim(data)

Here, we can see that there are 150 rows and 5 columns in the data:

Let's display the first five records of the data:

head(data)

Let's have a glance at the data:

Now, let's have a look at the datatypes of the variables in the dataset:

str(data)

Here, we can see that all the columns except Species are numeric. Species is the response variable for this classification exercise:

Let's also look at a summary of the data to see the distribution of the variables:

summary(data)

We get the following output:

Now, we can work on the data transformation. To work with the keras package, we need to convert the data into an array or a matrix. The matrix data elements should be of the same basic type, but here, we have target values that are of the factor type, so we need to change this:

# Converting the data into a matrix for keras to consume
data[,5] <- as.numeric(data[,5]) -1
data <- as.matrix(data)

# Setting dimnames of data to NULL
dimnames(data) <- NULL
head(data)

Now, we need to split the data into training and testing datasets. The seed number is the starting point that's used when generating a sequence of random numbers. Using the same number inside the function ensures that we can reproduce the same data each time the code is run:

set.seed(76)

# Training and testing data sample size
indexes <- sample(2,nrow(data),replace = TRUE,prob = c(0.70,0.30))

We divide the data in the ratio of 70:30 for the training and testing datasets, respectively:

# Splitting the predictor variables into training and testing
data.train <- data[indexes==1, 1:4]
data.test <- data[indexes==2, 1:4]

# Splitting the label attribute(response variable)into training and testing
data.trainingtarget <- data[indexes==1, 5]
data.testtarget <- data[indexes==2, 5]

Next, we one-hot encode the target column of the training and test data. The to_categorical() function converts a class vector into a binary class matrix:

data.trainLabels <- to_categorical(data.trainingtarget)
data.testLabels <- to_categorical(data.testtarget)

Now, let's build the model and compile it. First, we need to initialize a Keras sequential model object:

# Initialize a sequential model
model <- keras_model_sequential()

Next, we stack a dense layer. Since this is a single-layer network, we stack one layer:

model %>%
    layer_dense(units = 3, activation = 'softmax',input_shape = ncol(data.train))

This layer is a three-node softmax layer that returns an array of three probability scores that sum to 1. Now, let's have a look at the summary of the model:

summary(model)

The output of the preceding code is as follows:

Compiling the model prepares it for training. When compiling the model, we specify a loss function and an optimizer name and metric in order to evaluate the model during training and testing:

# Compile the model
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = 'accuracy'
)

Now, we train the model:

# Fit the model
model %>% fit(data.train,
data.trainLabels,
epochs = 200,
batch_size = 5,
validation_split = 0.2
)

Let's visualize the metrics of the trained model:

history <- model %>% fit(data.train,
data.trainLabels,
epochs = 200,
batch_size = 5,
validation_split = 0.2
)

# Plotting the model metrics - loss and accuracy
plot(history)

In the following plot, loss and acc indicate the loss and accuracy of the model for the training and validation data:

Now, we generate predictions for the test data. Here, we use the predict_classes() function to predict the classes for the test data. We're using a batch size of 128:

classes <- model %>% predict_classes(data.test, batch_size = 128)

The following code provides us with the confusion matrix, which lets us see the correct and incorrect predictions:

table(data.testtarget, classes)

The following table shows the confusion matrix for the test data:

Finally, let's evaluate the model's performance on the test data:

score <- model %>% evaluate(data.test, data.testLabels, batch_size = 128)

Now, we print the model scores:

print(score)

The following screenshot shows the loss and accuracy of the model on the test data:

We can see that our model's accuracy is about 75%.

How it works...

In step 1, we loaded the iris data from the datasets library in R. It is always advised to be aware of the data and its characteristics before we start building models. Hence, we studied the structure and type of the variables in the data. We saw that apart from Species (our response (target) variable), all the other variables were numeric. Then, we checked the dimensions of our dataset.

The summary() function shows us the distribution of variables and the central tendency metric for each of these variables in the data. The head() function, by default, displays only the first five rows of the dataset.

You can use head() to display any number of records. To do this, you need to pass the number of records as an argument in the head function. If you want to see the records from the end of the data, use the tail() function.

In step 2, we did the required data transformations. To work with the keras package, we need to convert the data into an array or a matrix. In our example, we changed our target column from the factor datatype to numeric and converted the data into a matrix format. From the summary of the dataset, it was clear that we did not need to normalize this data.

If we need to deal with some data that hasn't been normalized, we can use the normalize() function from keras.

Next, in step 3, we divided the data into training and testing sets in the ratio of 70:30. Note that before dividing the data into training and testing, we set the seed, with a random integer being passed as an argument to it. The seed() function helps generate the same sequence of random numbers when we supply the same number (seed) inside the function.

While building a multi-class classification model with neural networks, it is recommended to transform the target attribute from a vector that contains values for each class value into a matrix with a boolean value for each class, indicating the presence or absence of that class value in an instance. To achieve this, in step 4, we used the to_categorical() function from the keras library.

In step 5, we built the model. First, we initialized a sequential model using the keras_model_sequential() function. Then, we added layers to the model. The model needs to know what input shape it should expect, so we specified the input shape in the first layer in our sequential model. The number of units is three because the number of output classes in our multi-class classification problem is three. Note that the activation function in this layer is softmax. This activation function is used when we need to predict probability values ranging between 0 and 1 as output. Then, we used the summary() function to get a summary of our model. There are a few more functions that can help us investigate the model, such as get_config() and get_layer().

Once we set up the architecture of our model, we compiled it. To compile the model, we need to provide a few settings:

Loss function: This measures the accuracy of the model during training. We need to minimize this function to reach convergence.
Optimizer: This metric helps update the model based on the data it sees and its loss function.
Metrics: These are used to evaluate the training and testing steps.

Other popular optimization algorithms include SGD, ADAM, and RMSprop. Choosing a loss function depends on the problem statement you are dealing with. For a classification problem, we generally use cross-entropy, while for a binary classification problem, we use the binary_crossentropy() loss function.

In step 6, we trained the model using the fit() method. An epoch refers to a single pass through the entire training set. The batch size defines the number of samples passed through the network.

In step 7, we plotted the model's metrics using the plot() function and analyzed the accuracy and loss of the training and validation data.

In the last step, we generated predictions for the test dataset and evaluated our model's performance. Note that since this is a classification model, we used the predict_classes() function to predict the outcomes. In the case of a regression exercise, we use the predict() function. We used the evaluate() function to check the accuracy of our model on the test data. By doing this, we saw that our model's accuracy was around 75.4%.

There's more...

Activation functions are used to learn non-linear and complex functional mappings between the inputs and the response variable in an artificial neural network. One thing to keep in mind is that an activation function should be differentiable so that backpropagation optimization can be performed in the network while computing gradients of error (loss) with respect to weights, in order to optimize weights and reduce errors. Let's have a look at some of the popular activation functions and their properties:

Sigmoid:
- A sigmoid function ranges between 0 and 1.
- It is usually used in an output layer of a binary classification problem.
- It is better than linear activation because the output of the activation function is in the range of (0,1) compared to (-inf, inf), so the output of the activation is bound. It scales down large negative numbers toward 0 and large positive numbers toward 1.
- Its output is not zero centered, which makes gradient updates go too far in different directions and makes optimization harder.
- It has a vanishing gradient problem.
- It also has slow convergence.

The sigmoid function is defined as follows:

Here is the graph of the sigmoid function:

Tangent Hyperbolic (tanh):
- The tanh function scales the values between -1 and 1.
- The gradient for tanh is steeper than it is for sigmoid.
- Unlike sigmoid, it is centered around zero, which makes optimization easier.
- It is usually used in hidden layers.
- It has a vanishing gradient problem.

The tanh function is defined as follows:

The following diagram is the graph of the tanh function:

Rectified linear units (ReLU):
- It is a non-linear function
- It ranges from 0 to infinity
- It does not have a vanishing gradient problem
- Its convergence is faster than sigmoid and tanh
- It has a dying ReLU problem
- It is used in hidden layers

The ReLU function is defined as follows:

Here is the graph for the ReLU function:

Now, let's look at the variants of ReLU:

Leaky ReLU:
- It doesn't have a dying ReLU problem as it doesn't have zero-slope parts
- Leaky ReLU learns faster then ReLU

Mathematically, the Leaky ReLU function can be defined as follows:

Here is a graphical representation of the Leaky ReLU function:

Exponential Linear Unit (ELU):
- It doesn't have the dying ReLU problem
- It saturates for large negative values

Mathematically, the ELU function can be defined as follows:

Here is a graphical representation of the ELU function:

Parametric Rectified Linear Unit (PReLu):
- PReLU is a type of leaky ReLU, where the value of alpha is determined by the network itself.

The mathematical definition of the PReLu function is as follows:

Thresholded Rectified linear unit:

The mathematical definition of the PReLu function is as follows:

Softmax:
- It is non-linear.
- It's usually used in the output layer of a multiclass classification problem.
- It calculates the probability distribution of the event over "n" different events (classes). It outputs values between 0 to 1 for all the classes and the sum of all the probabilities is 1.

The mathematical definition of the softmax function is as follows: