Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Deep Learning for Natural Language Processing

You're reading from   Deep Learning for Natural Language Processing Solve your natural language processing problems with smart deep neural networks

Arrow left icon
Product type Paperback
Published in Jun 2019
Publisher
ISBN-13 9781838550295
Length 372 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (4):
Arrow left icon
Karthiek Reddy Bokka Karthiek Reddy Bokka
Author Profile Icon Karthiek Reddy Bokka
Karthiek Reddy Bokka
Monicah Wambugu Monicah Wambugu
Author Profile Icon Monicah Wambugu
Monicah Wambugu
Tanuj Jain Tanuj Jain
Author Profile Icon Tanuj Jain
Tanuj Jain
Shubhangi Hora Shubhangi Hora
Author Profile Icon Shubhangi Hora
Shubhangi Hora
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

About the Book 1. Introduction to Natural Language Processing FREE CHAPTER 2. Applications of Natural Language Processing 3. Introduction to Neural Networks 4. Foundations of Convolutional Neural Network 5. Recurrent Neural Networks 6. Gated Recurrent Units (GRUs) 7. Long Short-Term Memory (LSTM) 8. State-of-the-Art Natural Language Processing 9. A Practical NLP Project Workflow in an Organization 1. Appendix

Chapter 7: Foundations of LSTM

Activity 9: Build a Spam or Ham classifier using a Simple RNN

Solution:

  1. Import required Python packages

    import pandas as pd

    import numpy as np

    from keras.models import Model, Sequential

    from keras.layers import SimpleRNN, Dense,Embedding

    from keras.preprocessing.text import Tokenizer

    from keras.preprocessing import sequence

  2. Read the input file containing a column that contains text and another column that contains the label for the text depicting whether the text is spam or not.

    df = pd.read_csv("drive/spam.csv", encoding="latin")

    df.head()

    The output is as follows:

    Figure 7.35: Input data file
    Figure 7.35: Input data file
  3. Label the columns in the input data.

    df = df[["v1","v2"]]

    df.head()

    The output is as follows:

    Figure 7.36: Labelled input data
    Figure 7.36: Labelled input data
  4. Count spam, ham characters in the v1 column.

    df["v1"].value_counts()

    The output is as follows:

    Figure 7.37: Value counts for spam or ham
    Figure 7.37: Value counts for spam or ham
  5. Get X as feature and Y as target.

    lab_map = {"ham":0, "spam":1}

    X = df["v2"].values

    Y = df["v1"].map(lab_map).values

  6. Convert to sequences and pad the sequences.

    max_words = 100

    mytokenizer = Tokenizer(nb_words=max_words,lower=True, split=" ")

    mytokenizer.fit_on_texts(X)

    text_tokenized = mytokenizer.texts_to_sequences(X)

    text_tokenized

    The output is as follows:

    Figure 7.38: Tokenized data
    Figure 7.38: Tokenized data
  7. Train the sequences

    max_len = 50

    sequences = sequence.pad_sequences(text_tokenized,maxlen=max_len)

    sequences

  8. Build the model

    model = Sequential()

    model.add(Embedding(max_words, 20, input_length=max_len))

    model.add(SimpleRNN(64))

    model.add(Dense(1, activation="sigmoid"))

    model.compile(loss='binary_crossentropy',

    optimizer='adam',

    metrics=['accuracy'])

    model.fit(sequences,Y,batch_size=128,epochs=10,

    validation_split=0.2)

  9. Predict the mail category on new test data.

    inp_test_seq = "WINNER! U win a 500 prize reward & free entry to FA cup final tickets! Text FA to 34212 to receive award"

    test_sequences = mytokenizer.texts_to_sequences(np.array([inp_test_seq]))

    test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)

    model.predict(test_sequences_matrix)

    The output is as follows:

Figure 7.39: Output for new test data
Figure 7.39: Output for new test data

Activity 10: Create a French to English translation model

Solution:

  1. Import the necessary Python packages and classes.

    import os

    import re

    import numpy as np

  2. Read the file in sentence pairs.

    with open("fra.txt", 'r', encoding='utf-8') as f:

    lines = f.read().split('\n')

    num_samples = 20000 # Using only 20000 pairs for this example

    lines_to_use = lines[: min(num_samples, len(lines) - 1)]

  3. Remove \u202f character

    for l in range(len(lines_to_use)):

    lines_to_use[l] = re.sub("\u202f", "", lines_to_use[l])

    for l in range(len(lines_to_use)):

    lines_to_use[l] = re.sub("\d", " NUMBER_PRESENT ", lines_to_use[l])

  4. Append 'BEGIN_ ' and ' _END' words to target sequences. Map words to integers.

    input_texts = []

    target_texts = []

    input_words = set()

    target_words = set()

    for line in lines_to_use:

    target_text, input_text = line.split('\t')

    target_text = 'BEGIN_ ' + target_text + ' _END'

    input_texts.append(input_text)

    target_texts.append(target_text)

    for word in input_text.split():

    if word not in input_words:

    input_words.add(word)

    for word in target_text.split():

    if word not in target_words:

    target_words.add(word)

    max_input_seq_length = max([len(i.split()) for i in input_texts])

    max_target_seq_length = max([len(i.split()) for i in target_texts])

    input_words = sorted(list(input_words))

    target_words = sorted(list(target_words))

    num_encoder_tokens = len(input_words)

    num_decoder_tokens = len(target_words)

  5. Define encoder-decoder inputs.

    input_token_index = dict(

    [(word, i) for i, word in enumerate(input_words)])

    target_token_index = dict(

    [(word, i) for i, word in enumerate(target_words)])

    encoder_input_data = np.zeros(

    (len(input_texts), max_input_seq_length),

    dtype='float32')

    decoder_input_data = np.zeros(

    (len(target_texts), max_target_seq_length),

    dtype='float32')

    decoder_target_data = np.zeros(

    (len(target_texts), max_target_seq_length, num_decoder_tokens),

    dtype='float32')

    for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):

    for t, word in enumerate(input_text.split()):

    encoder_input_data[i, t] = input_token_index[word]

    for t, word in enumerate(target_text.split()):

    decoder_input_data[i, t] = target_token_index[word]

    if t > 0:

    # decoder_target_data is ahead of decoder_input_data #by one timestep

    decoder_target_data[i, t - 1, target_token_index[word]] = 1.

  6. Build the model.

    from keras.layers import Input, LSTM, Embedding, Dense

    from keras.models import Model

    embedding_size = 50

  7. Initiate encoder training.

    encoder_inputs = Input(shape=(None,))

    encoder_after_embedding = Embedding(num_encoder_tokens, embedding_size)(encoder_inputs)

    encoder_lstm = LSTM(50, return_state=True)_,

    state_h, state_c = encoder_lstm(encoder_after_embedding)

    encoder_states = [state_h, state_c]

  8. Initiate decoder training.

    decoder_inputs = Input(shape=(None,))

    decoder_after_embedding = Embedding(num_decoder_tokens, embedding_size)(decoder_inputs)

    decoder_lstm = LSTM(50, return_sequences=True, return_state=True)

    decoder_outputs, _, _ = decoder_lstm(decoder_after_embedding,

    initial_state=encoder_states)

    decoder_dense = Dense(num_decoder_tokens, activation='softmax')

    decoder_outputs = decoder_dense(decoder_outputs)

  9. Define the final model.

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

    model.fit([encoder_input_data, decoder_input_data],

    decoder_target_data,

    batch_size=128,

    epochs=20,

    validation_split=0.05)

  10. Provide inferences to encoder and decoder

    # encoder part

    encoder_model = Model(encoder_inputs, encoder_states)

    # decoder part

    decoder_state_input_h = Input(shape=(50,))

    decoder_state_input_c = Input(shape=(50,))

    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

    decoder_outputs_inf, state_h_inf, state_c_inf = decoder_lstm(decoder_after_embedding, initial_state=decoder_states_inputs)

    decoder_states_inf = [state_h_inf, state_c_inf]

    decoder_outputs_inf = decoder_dense(decoder_outputs_inf)

    decoder_model = Model(

    [decoder_inputs] + decoder_states_inputs,

    [decoder_outputs_inf] + decoder_states_inf)

  11. Reverse-lookup token index to decode sequences

    reverse_input_word_index = dict(

    (i, word) for word, i in input_token_index.items())

    reverse_target_word_index = dict(

    (i, word) for word, i in target_token_index.items())

    def decode_sequence(input_seq):

  12. Encode input as a state vector

    states_value = encoder_model.predict(input_seq)

  13. Generate empty target sequence of length 1.

    target_seq = np.zeros((1,1))

  14. Populate the first character of target sequence with the start character.

    target_seq[0, 0] = target_token_index['BEGIN_']

  15. Sampling loop for a batch of sequences

    stop_condition = False

    decoded_sentence = ''

    while not stop_condition:

    output_tokens, h, c = decoder_model.predict(

    [target_seq] + states_value)

  16. Sample a token.

    sampled_token_index = np.argmax(output_tokens)

    sampled_word = reverse_target_word_index[sampled_token_index]

    decoded_sentence += ' ' + sampled_word

  17. Exit condition: either hit max length or find stop character.

    if (sampled_word == '_END' or

    len(decoded_sentence) > 60):

    stop_condition = True

  18. Update the target sequence (of length 1).

    target_seq = np.zeros((1,1))

    target_seq[0, 0] = sampled_token_index

  19. Update states

    states_value = [h, c]

    return decoded_sentence

  20. Inference for user input: take in a word sequence, convert the sequence word by word into encoded.

    text_to_translate = "Où est ma voiture??"

    encoder_input_to_translate = np.zeros(

    (1, max_input_seq_length),

    dtype='float32')

    for t, word in enumerate(text_to_translate.split()):

    encoder_input_to_translate[0, t] = input_token_index[word]

    decode_sequence(encoder_input_to_translate)

    The output is as follows:

Figure 7.47: French to English translator
Figure 7.47: French to English translator
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime