Packt+ | Advance your knowledge in tech

You're reading from Python Deep Learning Cookbook

Product type Book

Published in Oct 2017

Publisher Packt

ISBN-13 9781787125193

Pages 330 pages

Edition 1st Edition

Languages

Python

Concepts

Deep Learning

Author (1):

Indra den Bakker

Table of Contents (21) Chapters

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Programming Environments, GPU Computing, Cloud Solutions, and Deep Learning Frameworks

2. Feed-Forward Neural Networks

3. Convolutional Neural Networks

4. Recurrent Neural Networks

5. Reinforcement Learning

6. Generative Adversarial Networks

7. Computer Vision

8. Natural Language Processing

9. Speech Recognition and Video Analysis

10. Time Series and Structured Data

11. Game Playing Agents and Robotics

12. Hyperparameter Selection, Tuning, and Neural Network Learning

13. Network Internals

14. Pretrained Models

Identifying speakers with voice recognition

Next to speech recognition, there is we can do with sound fragments. While speech recognition focuses on converting speech (spoken words) to digital data, we can also use fragments to identify the person who is speaking. This is also known as voice recognition. Every individual has different characteristics when speaking, caused by differences in anatomy and behavioral patterns. Speaker verification and speaker identification are getting more attention in this digital age. For example, a home digital assistant can automatically detect which person is speaking.

In the following recipe, we'll be using the same data as in the previous recipe, where we implemented a speech recognition pipeline. However, this time, we will be classifying the speakers of the spoken numbers.

How to do it...

In this recipe, we start by importing all libraries:

import glob
import numpy as np
import random
import librosa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

import keras
from keras.layers import LSTM, Dense, Dropout, Flatten
from keras.models import Sequential
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint

Let's set SEED and the location of the .wav files:

SEED = 2017
DATA_DIR = 'Data/spoken_numbers_pcm/'

Let's split the .wav files in a training set and a validation set with scikit-learn's train_test_split function:

files = glob.glob(DATA_DIR + "*.wav")
X_train, X_val = train_test_split(files, test_size=0.2, random_state=SEED)

print('# Training examples: {}'.format(len(X_train)))
print('# Validation examples: {}'.format(len(X_val)))

To extract and print all unique labels, we use the following code:

labels = []
for i in range(len(X_train)):
    label = X_train[i].split('/')[-1].split('_')[1]
    if label not in labels:
        labels.append(label)
print(labels)

We can now define our one_hot_encode function as follows:

label_binarizer = LabelBinarizer()
label_binarizer.fit(list(set(labels)))

def one_hot_encode(x): return label_binarizer.transform(x)

Before we can feed the data to our network, some preprocessing needs to be done. We use the following settings:

n_features = 20
max_length = 80
n_classes = len(labels)

We can now our batch generator. The generator all preprocessing tasks, such as reading a .wav file and transforming it into usable input:

def batch_generator(data, batch_size=16):
    while 1:
        random.shuffle(data)
        X, y = [], []
        for i in range(batch_size):
            wav = data[i]
            wave, sr = librosa.load(wav, mono=True)
            label = wav.split('/')[-1].split('_')[1]
            y.append(one_hot_encode(label))
            mfcc = librosa.feature.mfcc(wave, sr)
            mfcc = np.pad(mfcc, ((0,0), (0, max_length-
            len(mfcc[0]))), mode='constant', constant_values=0) 
            X.append(np.array(mfcc))
        yield np.array(X), np.array(y)

Note

Please note the difference in our batch generator compared to the previous recipe.

Let's define the hyperparameters before defining our network architecture:

learning_rate = 0.001
batch_size = 64
n_epochs = 50
dropout = 0.5

input_shape = (n_features, max_length)
steps_per_epoch = 50

The network architecture we will use is quite straightforward. We will stack an LSTM layer on top of a dense layer, as follows:

 model = Sequential()
 model.add(LSTM(256, return_sequences=True, input_shape=input_shape,
   dropout=dropout))
 model.add(Flatten())
 model.add(Dense(128, activation='relu'))
 model.add(Dropout(dropout))
 model.add(Dense(n_classes, activation='softmax'))

Next, we set the function, compile the model, and a summary of our model:

opt = Adam(lr=learning_rate)
 model.compile(loss='categorical_crossentropy', optimizer=opt,
 metrics=['accuracy'])
 model.summary()

To prevent overfitting, we will be using early stopping and automatically store the model that has the highest validation accuracy:

callbacks = [ModelCheckpoint('checkpoints/voice_recognition_best_model_{epoch:02d}.hdf5', save_best_only=True),
            EarlyStopping(monitor='val_acc', patience=2)]

We are ready to start training and we will store the results in history:

 history = model.fit_generator(
   generator=batch_generator(X_train, batch_size),
   steps_per_epoch=steps_per_epoch,
   epochs=n_epochs,
   verbose=1,
   validation_data=batch_generator(X_val, 32),
   validation_steps=5,
   callbacks=callbacks
 )

In the following figure, the training accuracy and validation accuracy are plotted against the epochs: