Packt+ | Advance your knowledge in tech

You're reading from Deep Learning for Natural Language Processing Solve your natural language processing problems with smart deep neural networks

Product type Paperback

Published in Jun 2019

Publisher

ISBN-13 9781838550295

Length 372 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Deep Learning

Authors (4):

Karthiek Reddy Bokka

Monicah Wambugu

Tanuj Jain

Shubhangi Hora

View More author details

Table of Contents (11) Chapters

About the Book

1. Introduction to Natural Language Processing FREE CHAPTER

2. Applications of Natural Language Processing

3. Introduction to Neural Networks

4. Foundations of Convolutional Neural Network

5. Recurrent Neural Networks

6. Gated Recurrent Units (GRUs)

7. Long Short-Term Memory (LSTM)

8. State-of-the-Art Natural Language Processing

9. A Practical NLP Project Workflow in an Organization

1. Appendix

Chapter 4: Introduction to convolutional networks

Activity 5: Sentiment Analysis on a real-life dataset

Solution:

Import the necessary classes
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras import layers
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import pandas as pd
Define your variables and parameters.
epochs = 20
maxlen = 100
embedding_dim = 50
num_filters = 64
kernel_size = 5
batch_size = 32
Import the data.
data = pd.read_csv('data/sentiment labelled sentences/yelp_labelled.txt',names=['sentence', 'label'], sep='\t')
data.head()
Printing this out on a Jupyter notebook should display:
Figure 4.27: Labelled dataset
Select the 'sentence' and 'label' columns
sentences=data['sentence'].values
labels=data['label'].values
Split your data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
sentences, labels, test_size=0.30, random_state=1000)
Tokenize
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index) + 1 #The vocabulary size has an additional 1 due to the 0 reserved index
Pad in order to ensure that all sequences have the same length
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
Create the model. Note that we use a sigmoid activation function on the last layer and the binary cross entropy for calculating loss. This is because we are doing a binary classification.
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
The above code should yield
Figure 4.28: Model summary
The model can be visualized as follows as well:
Figure 4.29: Model visualization
Train and test the model.
model.fit(X_train, y_train,
epochs=epochs,
verbose=False,
validation_data=(X_test, y_test),
batch_size=batch_size)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))
The accuracy output should be as follows: