Chapter 4: Neural Networks with NLP
Activity 4: Predict the Next Character in a Sequence
Solution
Import the libraries we need to solve the activity:
import tensorflow as tf from keras.models import Sequential from keras.layers import LSTM, Dense, Activation, LeakyReLU import numpy as np
Define the sequence of characters and multiply it by 100:
char_seq = 'qwertyuiopasdfghjklñzxcvbnm' * 100 char_seq = list(char_seq)
Create a char2id dictionary to relate every character with an integer:
char2id = dict([(char, idx) for idx, char in enumerate(set(char_seq))])
Divide the sentence of characters into time series. The maximum length of time series will be five, so we will have vectors of five characters. Also, we are going to create the upcoming vector. The y_labels variable is the size of our vocabulary. We will use this variable later:
maxlen = 5 sequences = [] next_char = [] for i in range(0,len(char_seq)-maxlen): sequences.append(char_seq[i:i+maxlen]) next_char.append(char_seq[i+maxlen]) y_labels = len(char2id) print("5 first sequences: {}".format(sequences[:5])) print("5 first next characters: {}".format(next_char[:5])) print("Total sequences: {}".format(len(sequences))) print("Total output labels: {}".format(y_labels))
So far, we have the sequences variable, which is an array of arrays, with the time series of characters. char is an array with the upcoming character. Now, we need to encode these vectors, so let's define a method to encode an array of characters using the information of char2id:
def one_hot_encoder(seq, ids): encoded_seq = np.zeros([len(seq),len(ids)]) for i,s in enumerate(seq): encoded_seq[i][ids[s]] = 1 return encoded_seq
Encode the variables into one-hot vectors. The shape of this is x = (2695,5,27) and y = (2695,27):
x = np.array([one_hot_encoder(item, char2id) for item in sequences]) y = np.array(one_hot_encoder(next_char, char2id)) x = x.astype(np.int32) y = y.astype(np.int32) print("Shape of x: {}".format(x.shape)) print("Shape of y: {}".format(y.shape))
Split the data into train and test sets. To do this, we are going to use the train_test_split method of sklearn:
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle=False) print('x_train shape: {}'.format(x_train.shape)) print('y_train shape: {}'.format(y_train.shape)) print('x_test shape: {}'.format(x_test.shape)) print('y_test shape: {}'.format(y_test.shape))
With the data ready to be inserted in the neural network, create a Sequential model with two layers:
First layer: LSTM with eight neurons (the activation is tanh). input_shape is the maximum length of the sequences and the size of the vocabulary. So, because of the shape of our data, we do not need to reshape anything.
Second layer: Dense with 27 neurons. This is how we successfully complete the activity. Using a LeakyRelu activation will give you a good score. But why? Our output has many zeroes, so the network could fail and just return a vector of zeroes. Using LeakyRelu prevents this problem:
model = Sequential() model.add(LSTM(8,input_shape=(maxlen,y_labels))) model.add(Dense(y_labels)) model.add(LeakyReLU(alpha=.01)) model.compile(loss='mse', optimizer='rmsprop')
Train the model. The batch_size we use is 32, and we have 25 epochs:
history = model.fit(x_train, y_train, batch_size=32, epochs=25, verbose=1)
Compute the error of your model.
print('MSE: {:.5f}'.format(model.evaluate(x_test, y_test)))
Predict the test data and see the average percentage of hits. With this model, you will obtain an average of more than 90%:
prediction = model.predict(x_test) errors = 0 for pr, res in zip(prediction, y_test): if not np.array_equal(np.around(pr),res): errors+=1 print("Errors: {}".format(errors)) print("Hits: {}".format(len(prediction) - errors)) print("Hit average: {}".format((len(prediction) - errors)/len(prediction)))
To end this activity, we need to create a function that accepts a sequence of characters and returns the next predicted value. To decode the prediction of the model, we first code a decode method. This method just search in the prediction the higher value and take the key character in the char2id dictionary.
def decode(vec): val = np.argmax(vec) return list(char2id.keys())[list(char2id.values()).index(val)]
Create a method to predict the next character in a given sentence:
def pred_seq(seq): seq = list(seq) x = one_hot_encoder(seq,char2id) x = np.expand_dims(x, axis=0) prediction = model.predict(x, verbose=0) return decode(list(prediction[0]))
Finally, introduce the sequence 'tyuio' to predict the upcoming character. It will return 'p':
pred_seq('tyuio')
Congratulations! You have finished the activity. You can predict a value outputting a temporal sequence. This is also very important in finances, that is, when predicting future prices or stock movements.
You can change the data and predict what you want. If you add a linguistic corpus, you will generate text from your own RNN language model. So, our future conversational agent could generate poems or news text.