Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as Multilayer Perceptrons (MLPs). The goal of a feedforward network is to approximate the function of f∗. For example, for a classifier, y=f∗(x) maps an input x to a label y. A feedforward network defines a mapping from input to label y=f(x;θ). It learns the value of the parameter θ that results in the best function approximation.
This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow.
Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many different functions which represent them. These functions are composed in a directed acyclic graph.
The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions f(1), f(2), and f(3) connected to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer.
Diagram showing various functions activated on input x to form a neural network
These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model.
Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let's take an example of classification with a feedforward network:
X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, hidden_size), stddev) weights_2 = initialize_weights((hidden_size, y_size), stddev) sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2)
Once the predicted value tensor has been defined, we calculate the cost function:
cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=<actual value>, logits=<predicted value>)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)
Here, OPERATION_NAME could be one of the following:
sigmoid_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x)))
_sentinel: Used to prevent positional parameters. Internal, do not use.
labels: A tensor of the same type and shape as logits.
logits: A tensor of type float32 or float64. The formula implemented is ( x = logits, z = labels) max(x, 0) - x * z + log(1 + exp(-abs(x))).
softmax = exp(logits) / reduce_sum(exp(logits), dim)
logits: A non-empty tensor. Must be one of the following types--half, float32, or float64.
dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension.
name: A name for the operation (optional).
tf.nn.log_softmax: Calculates the log of the softmax function and helps in normalizing underfitting. This function is also just a normalization function.
log_softmax( logits, dim=-1, name=None )
logits: A non-empty tensor. Must be one of the following types--half, float32, or float64.
dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension.
name: A name for the operation (optional).
softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None )
_sentinel: Used to prevent positional parameters. For internal use only.
labels: Each rows labels[i] must be a valid probability distribution.
logits: Unscaled log probabilities.
dim: The class dimension. Defaulted to -1, which is the last dimension.
name: A name for the operation (optional).
The preceding code snippet computes softmax cross entropy between logits and labels. While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) sparse_softmax_cross_entropy_with_logits.
sparse_softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )
labels: Tensor of shape [d_0, d_1, ..., d_(r-1)] (where r is the rank of labels and result) and dtype, int32, or int64. Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU.
logits: Unscaled log probabilities of shape [d_0, d_1, ..., d_(r-1), num_classes] and dtype, float32, or float64.
The preceding code computes sparse softmax cross entropy between logits and labels. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label's vector must provide a single specific index for the true class for each row of logits.
weighted_cross_entropy_with_logits( targets, logits, pos_weight, name=None )
targets: A tensor of the same type and shape as logits.
logits: A tensor of type float32 or float64.
pos_weight: A coefficient to use on the positive examples.
This is similar to sigmoid_cross_entropy_with_logits() except that pos_weight allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error.
Let's look at a feedforward example using the Iris dataset.
You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv.
In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor.
Petal geometry compared from three iris species:
Iris Setosa, Iris Virginica, and Iris Versicolor.
In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica.
First, we will create a run() function that takes three parameters--hidden layer size h_size, standard deviation for weights stddev, and Step size of Stochastic Gradient Descent sgd_step:
def run(h_size, stddev, sgd_step)
Input data loading is done using the genfromtxt function in numpy. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the all_X variable. Target labels are loaded from target.csv in all_Y with the shape of L: 150, W:3:
def load_iris_data(): from numpy import genfromtxt data = genfromtxt('iris.csv', delimiter=',') target = genfromtxt('target.csv', delimiter=',').astype(int) # Prepend the column of 1s for bias L, W = data.shape all_X = np.ones((L, W + 1)) all_X[:, 1:] = data num_labels = len(np.unique(target)) all_y = np.eye(num_labels)[target] return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED)
Once data is loaded, we initialize the weights matrix based on x_size, y_size, and h_size with standard deviation passed to the run() method:
# Size of Layers x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias y_size = train_y.shape[1] # Outcomes (3 iris flowers) # variables X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, h_size), stddev) weights_2 = initialize_weights((h_size, y_size), stddev)
Next, we make the prediction using sigmoid as the activation function defined in the forward_propagration() function:
def forward_propagation(X, weights_1, weights_2): sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) return y
First, sigmoid output is calculated from input X and weights_1. This is then used to calculate y as a matrix multiplication of sigmoid and weights_2:
y_pred = forward_propagation(X, weights_1, weights_2) predict = tf.argmax(y_pred, dimension=1)
Next, we define the cost function and optimization using gradient descent. Let's look at the GradientDescentOptimizer being used. It is defined in the tf.train.GradientDescentOptimizer class and implements the gradient descent algorithm.
To construct an instance, we use the following constructor and pass sgd_step as a parameter:
# constructor for GradientDescentOptimizer __init__( learning_rate, use_locking=False, name='GradientDescent' )
Arguments passed are explained here:
The following list shows the code to implement the cost function:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost)
Next, we will implement the following steps:
sess = tf.Session()
We stored the accuracy for each step in a list so that we could plot a graph:
init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]})
train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
sess.run(predict, feed_dict={X: train_x, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict, feed_dict={X: test_x, y: test_y}))
print("%d, %.2f%%, %.2f%%"
% (step + 1, 100. * train_accuracy, 100. * test_accuracy))
test_acc.append(100. * test_accuracy)
train_acc.append(100. * train_accuracy)
Let's run this code for h_size of 128, standard deviation of 0.1, and sgd_step of 0.01:
def run(h_size, stddev, sgd_step):
...
def main():
run(128,0.1,0.01)
if __name__ == '__main__':
main()
The preceding code outputs the following graph, which plots the steps versus the test and train accuracy:
Let's compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels.
def run(h_size, stddev, sgd_steps): .... test_accs = [] train_accs = [] time_taken_summary = [] for sgd_step in sgd_steps: start_time = time.time() updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) sess = tf.Session() init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = []
print("Step, train accuracy, test accuracy")
for step in range(steps):
# Train with each example
for i in range(len(train_x)):
sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1],
y: train_y[i: i + 1]})
train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
sess.run(predict,
feed_dict={X: train_x, y: train_y}))
test_accuracy = np.mean(np.argmax(test_y, axis=1) ==
sess.run(predict,
feed_dict={X: test_x, y: test_y}))
print("%d, %.2f%%, %.2f%%"
% (step + 1, 100. * train_accuracy, 100. * test_accuracy))
#x.append(step)
test_acc.append(100. * test_accuracy)
train_acc.append(100. * train_accuracy)
end_time = time.time()
diff = end_time -start_time
time_taken_summary.append((sgd_step,diff))
t = [np.array(test_acc)]
t.append(train_acc)
train_accs.append(train_acc)
Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function sgd_steps for an SGD step value of [0.01, 0.02, 0.03]:
def main():
sgd_steps = [0.01,0.02,0.03]
run(128,0.1,sgd_steps)
if __name__ == '__main__':
main()
This is the plot showing how training accuracy changes with sgd_steps. For an SGD value of 0.03, it reaches a higher accuracy faster as the step size is larger.
In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset.
You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today!