The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python:
>>> import os
""" First change the following directory link to where all input files do exist """
>>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL")
>>> import numpy as np
>>> import pandas as pd
# File reading
>>> with open('bot.txt', 'r') as content_file:
... botdata = content_file.read()
>>> Questions = []
>>> Answers = []
AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively:
>>> for line in botdata.split("</pattern>"):
... if "<pattern>" in line:
... Quesn = line[line.find("<pattern>")+len("<pattern>"):]
... Questions.append(Quesn.lower())
>>> for line in botdata.split("</template>"):
... if "<template>" in line:
... Ans = line[line.find("<template>")+len("<template>"):]
... Ans = Ans.lower()
... Answers.append(Ans.lower())
>>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"])
>>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"]
>>> print(QnAdata.head())
The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model.
After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results:
The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings:
# Creating Vocabulary
>>> import nltk
>>> import collections
>>> counter = collections.Counter()
>>> for i in range(len(QnAdata)):
... for word in nltk.word_tokenize(QnAdata.iloc[i][2]):
... counter[word]+=1
>>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}
>>> idx2word = {v:k for k,v in word2idx.items()}
>>> idx2word[0] = "PAD"
>>> vocab_size = len(word2idx)+1
>>> print (vocab_size)
Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data:
>>> def encode(sentence, maxlen,vocab_size):
... indices = np.zeros((maxlen, vocab_size))
... for i, w in enumerate(nltk.word_tokenize(sentence)):
... if i == maxlen: break
... indices[i, word2idx[w]] = 1
... return indices
>>> def decode(indices, calc_argmax=True):
... if calc_argmax:
... indices = np.argmax(indices, axis=-1)
... return ' '.join(idx2word[x] for x in indices)
The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models:
>>> question_maxlen = 10
>>> answer_maxlen = 20
>>> def create_questions(question_maxlen,vocab_size):
... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size))
... for q in range(len(Questions)):
... question = encode(Questions[q],question_maxlen,vocab_size)
... question_idx[q] = question
... return question_idx
>>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size)
>>> def create_answers(answer_maxlen,vocab_size):
... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size))
... for q in range(len(Answers)):
... answer = encode(Answers[q],answer_maxlen,vocab_size)
... answer_idx[q] = answer
... return answer_idx
>>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size)
>>> from keras.layers import Input,Dense,Dropout,Activation
>>> from keras.models import Model
>>> from keras.layers.recurrent import LSTM
>>> from keras.layers.wrappers import Bidirectional
>>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization
The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size:
>>> n_hidden = 128
>>> question_layer = Input(shape=(question_maxlen,vocab_size))
>>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer)
>>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn)
>>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode)
>>> regularized_layer = ActivityRegularization(l2=1)(dense_layer)
>>> softmax_layer = Activation('softmax')(regularized_layer)
>>> model = Model([question_layer],[softmax_layer])
>>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
>>> print (model.summary())
The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension:
# Model Training
>>> quesns_train_2 = quesns_train.astype('float32')
>>> answs_train_2 = answs_train.astype('float32')
>>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05)
The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less:
# Model prediction
>>> ans_pred = model.predict(quesns_train_2[0:3])
>>> print (decode(ans_pred[0]))
>>> print (decode(ans_pred[1]))
The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models:
Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try:
Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations
Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably.
Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above.
If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.