Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Deep Learning Cookbook

You're reading from  Python Deep Learning Cookbook

Product type Book
Published in Oct 2017
Publisher Packt
ISBN-13 9781787125193
Pages 330 pages
Edition 1st Edition
Languages
Author (1):
Indra den Bakker Indra den Bakker
Profile icon Indra den Bakker
Toc

Table of Contents (21) Chapters close

Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
1. Programming Environments, GPU Computing, Cloud Solutions, and Deep Learning Frameworks 2. Feed-Forward Neural Networks 3. Convolutional Neural Networks 4. Recurrent Neural Networks 5. Reinforcement Learning 6. Generative Adversarial Networks 7. Computer Vision 8. Natural Language Processing 9. Speech Recognition and Video Analysis 10. Time Series and Structured Data 11. Game Playing Agents and Robotics 12. Hyperparameter Selection, Tuning, and Neural Network Learning 13. Network Internals 14. Pretrained Models

Identifying speakers with voice recognition


Next to speech recognition, there is we can do with sound fragments. While speech recognition focuses on converting speech (spoken words) to digital data, we can also use fragments to identify the person who is speaking. This is also known as voice recognition. Every individual has different characteristics when speaking, caused by differences in anatomy and behavioral patterns. Speaker verification and speaker identification are getting more attention in this digital age. For example, a home digital assistant can automatically detect which person is speaking.

In the following recipe, we'll be using the same data as in the previous recipe, where we implemented a speech recognition pipeline. However, this time, we will be classifying the speakers of the spoken numbers. 

How to do it...

  1. In this recipe, we start by importing all libraries:
import glob
import numpy as np
import random
import librosa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

import keras
from keras.layers import LSTM, Dense, Dropout, Flatten
from keras.models import Sequential
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
  1. Let's set SEED and the location of the .wav files:
SEED = 2017
DATA_DIR = 'Data/spoken_numbers_pcm/' 
  1. Let's split the .wav files in a training set and a validation set with scikit-learn's train_test_split function:
files = glob.glob(DATA_DIR + "*.wav")
X_train, X_val = train_test_split(files, test_size=0.2, random_state=SEED)

print('# Training examples: {}'.format(len(X_train)))
print('# Validation examples: {}'.format(len(X_val)))
  1. To extract and print all unique labels, we use the following code:
labels = []
for i in range(len(X_train)):
    label = X_train[i].split('/')[-1].split('_')[1]
    if label not in labels:
        labels.append(label)
print(labels)
  1. We can now define our one_hot_encode function as follows:
label_binarizer = LabelBinarizer()
label_binarizer.fit(list(set(labels)))

def one_hot_encode(x): return label_binarizer.transform(x)
  1. Before we can feed the data to our network, some preprocessing needs to be done. We use the following settings:
n_features = 20
max_length = 80
n_classes = len(labels)
  1. We can now our batch generator. The generator all preprocessing tasks, such as reading a .wav file and transforming it into usable input:
def batch_generator(data, batch_size=16):
    while 1:
        random.shuffle(data)
        X, y = [], []
        for i in range(batch_size):
            wav = data[i]
            wave, sr = librosa.load(wav, mono=True)
            label = wav.split('/')[-1].split('_')[1]
            y.append(one_hot_encode(label))
            mfcc = librosa.feature.mfcc(wave, sr)
            mfcc = np.pad(mfcc, ((0,0), (0, max_length-
            len(mfcc[0]))), mode='constant', constant_values=0) 
            X.append(np.array(mfcc))
        yield np.array(X), np.array(y)

Note

Please note the difference in our batch generator compared to the previous recipe.

  1. Let's define the hyperparameters before defining our network architecture:
learning_rate = 0.001
batch_size = 64
n_epochs = 50
dropout = 0.5

input_shape = (n_features, max_length)
steps_per_epoch = 50
  1. The network architecture we will use is quite straightforward. We will stack an LSTM layer on top of a dense layer, as follows:
 model = Sequential()
 model.add(LSTM(256, return_sequences=True, input_shape=input_shape,
   dropout=dropout))
 model.add(Flatten())
 model.add(Dense(128, activation='relu'))
 model.add(Dropout(dropout))
 model.add(Dense(n_classes, activation='softmax'))
  1. Next, we set the function, compile the model, and a summary of our model:
opt = Adam(lr=learning_rate)
 model.compile(loss='categorical_crossentropy', optimizer=opt,
 metrics=['accuracy'])
 model.summary()
  1. To prevent overfitting, we will be using early stopping and automatically store the model that has the highest validation accuracy:
callbacks = [ModelCheckpoint('checkpoints/voice_recognition_best_model_{epoch:02d}.hdf5', save_best_only=True),
            EarlyStopping(monitor='val_acc', patience=2)]
  1. We are ready to start training and we will store the results in history:
 history = model.fit_generator(
   generator=batch_generator(X_train, batch_size),
   steps_per_epoch=steps_per_epoch,
   epochs=n_epochs,
   verbose=1,
   validation_data=batch_generator(X_val, 32),
   validation_steps=5,
   callbacks=callbacks
 )

In the following figure, the training accuracy and validation accuracy are plotted against the epochs:

Figure 9.1: Training and validation accuracy 

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime