You're reading from The TensorFlow Workshop A hands-on guide to building deep learning models from scratch using real-world datasets

Product type Paperback

Published in Dec 2021

Publisher Packt

ISBN-13 9781800205253

Length 600 pages

Edition 1st Edition

Languages

Python

Tools

TensorFlow

Concepts

Deep Learning

Authors (4):

Matthew Moocarme

Abhranshu Bagchi

Anthony Maddalone

Anthony So

View More author details

Table of Contents (13) Chapters

Preface

1. Introduction to Machine Learning with TensorFlow

2. Loading and Processing Data FREE CHAPTER

3. TensorFlow Development

4. Regression and Classification Models

5. Classification Models

6. Regularization and Hyperparameter Tuning

7. Convolutional Neural Networks

8. Pre-Trained Networks

9. Recurrent Neural Networks

10. Custom TensorFlow Components

11. Generative Models

Appendix

Audio Processing

This section will demonstrate how to load audio data in batches, as well as how to process it so that it can be used to train machine learning models. There is some advanced signal processing that takes place to preprocess audio files. Some of these steps are optional, but they are presented to provide a comprehensive approach to processing audio data. Since each audio file can be hundreds of KB, you will utilize batch processing, as you did when processing image data. Batch processing can be achieved by creating a dataset object. A generic method for creating a dataset object from raw data is using TensorFlow's from_tensor_slice function. This function generates a dataset object by slicing a tensor along its first dimension. It can be used as follows:

dataset = tf.data.Dataset\
            .from_tensor_slices([1, 2, 3, 4, 5])

Loading audio data into a Python environment can be achieved using TensorFlow by reading the file into memory using the read_file function, then decoding the file using the decode_wav function. When using the decode_wav function, the sample rate, which represents how many data points comprise 1 second of data, as well as the desired channel to use must be passed in as arguments. For example, if a value of -1 is passed for the desired channel, then all the audio channels will be decoded. Importing the audio file can be achieved as follows:

sample_rate = 44100
audio_data = tf.io.read_file('path/to/file')
audio, sample_rate = tf.audio.decode_wav\
                     (audio_data,\
                      desired_channels=-1,\
                      desired_samples=sample_rate)

As with text data, you must preprocess the data so that the resulting numerical tensor has the same size as the data. This is achieved by sampling the audio file after converting the data into the frequency domain. Sampling the audio can be thought of as splitting the audio file into chunks that are always the same size. For example, a 30-second audio file can be split into 30 1-second non-overlapping audio samples, and in the same way, a 15-second audio file can be split into 15 1-second non-overlapping samples. Thus, your result is 45 equally sized audio samples.

Another common preprocessing step that can be performed on audio data is to convert the audio sample from the time domain into the frequency domain. Interpreting the data in the time domain is useful for understanding the intensity or volume of the audio, whereas the frequency domain can help you discover which frequencies are present. This is useful for classifying sounds since different objects have different characteristic sounds that will be present in the frequency domain. Audio data can be converted from the time domain into the frequency domain using the stft function.

This function takes the short-time Fourier transform of the input data. The arguments to the function include the frame length, which is an integer value that indicates the window length in samples; the frame step, which is an integer value that describes the number of samples to step; and the Fast Fourier Transform (FFT) length, which is an integer value that indicates the length of the FFT to apply. A spectrogram is the absolute value of the short-time Fourier transform as it is useful for visual interpretation. The short-time Fourier transform and spectrogram can be created as follows:

stfts = tf.signal.stft(audio, frame_length=1024,\
                       frame_step=256,\
                       fft_length=1024)
spectrograms = tf.abs(stfts)

Another optional preprocessing step is to generate the Mel-Frequency Cepstral Coefficients (MFCCs). As the name suggests, the MFCCs are the coefficients of the mel-frequency cepstrum. The cepstrum is a representation of the short-term power spectrum of an audio signal. MFCCs are commonly used in applications for speech recognition and music information retrieval. As such, it may not be important to understand each step of how the MFCCs are generated but understanding that they can be applied as a preprocessing step to increase the information density of the audio data pipeline is beneficial.

MFCCs are generated by creating a matrix to warp the linear scale to the mel scale. This matrix can be created using linear_to_mel_weight_matrix and by passing in the number of bands in the resulting mel spectrum, the number of bins in the source spectrogram, the sample rate, and the lower and upper frequencies to be included in the mel spectrum. Once the linear-to-mel weight matrix has been created, a tensor contraction with the spectrograms is applied along the first axis using the tensordot function.

Following this, the log of the values is applied to generate the log mel spectrograms. Finally, the mfccs_from_log_mel_spectrograms function can be applied to generate the MFCCs that are passing in the log mel spectrograms. These steps can be applied as follows:

lower_edge_hertz, upper_edge_hertz, num_mel_bins \
    = 80.0, 7600.0, 80
linear_to_mel_weight_matrix \
    = tf.signal.linear_to_mel_weight_matrix\
      (num_mel_bins, num_spectrogram_bins, sample_rate, \
       lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot\
                   (spectrograms, \
                    linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape\
    (spectrograms.shape[:-1].concatenate\
    (linear_to_mel_weight_matrix.shape[-1:]))
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
mfccs = tf.signal.mfccs_from_log_mel_spectrograms\
        (log_mel_spectrograms)[..., :num_mfccs]

In the following exercise, you will understand how audio data can be processed. In a similar manner to what you did in Exercise 2.03, Loading Image Data for Batch Processing, and Exercise 2.04, Loading Text Data for TensorFlow Models, you will load the data in batches for efficient and scalable training. You will load in the audio files using TensorFlow's generic read_file function, then decode the audio data using TensorFlow's decode_wav function. You will then create a function that will generate the MFCCs from each audio sample. Finally, a dataset object will be generated that can be passed into a TensorFlow model for training. The dataset that you will be utilizing is Google's speech commands dataset, which consists of 1-second-long utterances of words.

Note

The dataset can be found here: https://packt.link/Byurf.

Exercise 2.05: Loading Audio Data for TensorFlow Models

In this exercise, you'll learn how to load in audio data for batch processing. The dataset, data_speech_commands_v0.02, contains speech samples of people speaking the word zero for exactly 1 second with a sample rate of 44.1 kHz, meaning that for every second, there are 44,100 data points. You will apply some common audio preprocessing techniques, including converting the data into the Fourier domain, sampling the data to ensure the data has the same size as the model, and generating MFCCs for each audio sample. This will generate a preprocessed dataset object that can be input into a TensorFlow model for training.

Perform the following steps:

Open a new Jupyter notebook to implement this exercise. Save the file as Exercise2-05.ipnyb.
In a new Jupyter Notebook cell, import the tensorflow and os libraries:
```
import tensorflow as tf
import os
```

Create a function that will load an audio file using TensorFlow's read_file function and decode_wav function, respectively. Return the transpose of the resultant tensor:

def load_audio(file_path, sample_rate=44100):
    # Load audio at 44.1kHz sample-rate
    audio = tf.io.read_file(file_path)
    audio, sample_rate = tf.audio.decode_wav\
                         (audio,\
                          desired_channels=-1,\
                          desired_samples=sample_rate)
    return tf.transpose(audio)

Load in the paths to the audio data as a list using os.list_dir:

prefix = " ../Datasets/data_speech_commands_v0.02"\
        "/zero/"
paths = [os.path.join(prefix, path) for path in \
         os.listdir(prefix)]

Test the function by loading in the first audio file from the list and plotting it:
```
import matplotlib.pyplot as plt
audio = load_audio(paths[0])
plt.plot(audio.numpy().T)
plt.xlabel('Sample')
plt.ylabel('Value')
```
The output will be as follows:
Figure 2.16: A visual representation of an audio file
The figure shows the waveform of the speech sample. The amplitude at a given time corresponds to the volume of the sound; high amplitude relates to high volume.

Create a function to generate the MFCCs from the audio data. First, apply the short-time Fourier transform passing in the audio signal as the first argument, the frame length set to 1024 as the second argument, the frame step set to 256 as the third argument, and the FFT length as the fourth parameter. Then, take the absolute value of the result to compute the spectrograms. The number of spectrogram bins is given by the length along the last axis of the short-time Fourier transform. Next, define the upper and lower bounds of the mel weight matrix as 80 and 7600 respectively and the number of mel bins as 80. Then, compute the mel weight matrix using linear_to_mel_weight_matrix from TensorFlow's signal package. Next, compute the mel spectrograms via tensor contraction using TensorFlow's tensordot function along axis 1 of the spectrograms with the mel weight matrix. Then, take the log of the mel spectrograms before finally computing the MFCCs using TensorFlow's mfccs_from_log_mel_spectrograms function. Then, return the MFCCs from the function:

def apply_mfccs(audio, sample_rate=44100, num_mfccs=13):
    stfts = tf.signal.stft(audio, frame_length=1024, \
                           frame_step=256, \
                           fft_length=1024)
    spectrograms = tf.abs(stfts)
    num_spectrogram_bins = stfts.shape[-1]#.value
    lower_edge_hertz, upper_edge_hertz, \
    num_mel_bins = 80.0, 7600.0, 80
    linear_to_mel_weight_matrix = \
      tf.signal.linear_to_mel_weight_matrix\
      (num_mel_bins, num_spectrogram_bins, \
       sample_rate, lower_edge_hertz, upper_edge_hertz)
    mel_spectrograms = tf.tensordot\
                       (spectrograms, \
                        linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape\
    (spectrograms.shape[:-1].concatenate\
    (linear_to_mel_weight_matrix.shape[-1:]))
    log_mel_spectrograms = tf.math.log\
                           (mel_spectrograms + 1e-6)
    #Compute MFCCs from log_mel_spectrograms
    mfccs = tf.signal.mfccs_from_log_mel_spectrograms\
            (log_mel_spectrograms)[..., :num_mfccs]
    return mfccs

Apply the function to generate the MFCCs for the audio data you loaded in Step 5:
```
mfcc = apply_mfccs(audio)
plt.pcolor(mfcc.numpy()[0])
plt.xlabel('MFCC log coefficient')
plt.ylabel('Sample Value')
```
The output will be as follows:
Figure 2.17: A visual representation of the MFCCs of an audio file
The preceding plot shows the MFCC values on the x axis and various points of the audio sample on the y axis. MFCCs are a different representation of the raw audio signal displayed in Step 5 that has been proven to be useful in applications related to speech recognition.

Load AUTOTUNE so that you can use all the available threads of the CPU. Create a function that will take a dataset object, shuffle it, load the audio using the function you created in Step 3, generate the MFCCs using the function you created in Step 6, repeat the dataset object, batch it, and prefetch it. Use AUTOTUNE to prefetch with a buffer size based on your available CPU:

AUTOTUNE = tf.data.experimental.AUTOTUNE
def prep_ds(ds, shuffle_buffer_size=1024, \
            batch_size=64):
    # Randomly shuffle (file_path, label) dataset
    ds = ds.shuffle(buffer_size=shuffle_buffer_size)
    # Load and decode audio from file paths
    ds = ds.map(load_audio, num_parallel_calls=AUTOTUNE)
    # generate MFCCs from the audio data
    ds = ds.map(apply_mfccs)
    # Repeat dataset forever
    ds = ds.repeat()
    # Prepare batches
    ds = ds.batch(batch_size)
    # Prefetch
    ds = ds.prefetch(buffer_size=AUTOTUNE)
    return ds

Generate the training dataset using the function you created in Step 8. To do this, create a dataset object using TensorFlow's from_tensor_slices function and pass in the paths to the audio files. After that, you can use the function you created in Step 8:
```
ds = tf.data.Dataset.from_tensor_slices(paths)
train_ds = prep_ds(ds)
```
Take the first batch of the dataset and print it out:
```
for x in train_ds.take(1):\
    print(x)
```
The output will be as follows:

Figure 2.18: A batch of the audio data after the MFCCs have been generated

The output shows the first batch of MFCC spectrum values in tensor form.

In this exercise, you imported audio data. You processed the dataset and batched the dataset so that it is appropriate for large-scale training. This method was a comprehensive approach in which the data was loaded and converted into the frequency domain, spectrograms were generated, and then finally the MFCCs were generated.

In the next activity, you will load in audio data and take the absolute value of the input, followed by scaling the values logarithmically. This will ensure that there are no negative values in the dataset. You will use the same audio dataset that you used in Exercise 2.05, Loading Audio Data for TensorFlow Models, that is, Google's speech commands dataset. This dataset consists of 1-second-long utterances of words.

Activity 2.03: Loading Audio Data for Batch Processing

In this activity, you will load audio data for batch processing. The audio preprocessing techniques that will be performed include taking the absolute value and using the logarithm of 1 plus the value. This will ensure the resulting values are non-negative and logarithmically scaled. The result will be a preprocessed dataset object that can be input into a TensorFlow model for training.

The steps for this activity are as follows:

Open a new Jupyter notebook to implement this activity.
Import the TensorFlow and os libraries.
Create a function that will load and then decode an audio file using TensorFlow's read_file function followed by the decode_wav function, respectively. Return the transpose of the resultant tensor from the function.
Load the file paths into the audio data as a list using os.list_dir.
Create a function that takes a dataset object, shuffles it, loads the audio using the function you created in step 2, and applies the absolute value and the log1p function to the dataset. This function adds 1 to each value in the dataset and then applies the logarithm to the result. Next, repeat the dataset object, batch it, and prefetch it with a buffer size equal to the batch size.
Create a dataset object using TensorFlow's from_tensor_slices function and pass in the paths to the audio files. Then, apply the function you created in Step 4 to the dataset created in Step 5.
Take the first batch of the dataset and print it out.
Plot the first audio file from the batch.
The output will look as follows:

Figure 2.19: Expected output of Activity 2.03

Note

The solution to this activity can be found via this link.

In this activity, you learned how to load and preprocess audio data in batches. You used most of the functions that you used in Exercise 2.05, Loading Audio Data for TensorFlow Models, to load in the data and decode the raw data. The difference between Exercise 2.05, Loading Audio Data for TensorFlow Models, and Activity 2.03, Loading Audio Data for Batch Processing, is the preprocessing steps; Exercise 2.05, Loading Audio Data for TensorFlow Models, involved generating MFCCs for the audio data, whereas Activity 2.03, Loading Audio Data for Batch Processing, involved scaling the data logarithmically. Both demonstrate common preprocessing techniques that can be used for all applications involving modeling on audio data.

In this section, you have explored how audio data can be loaded in batches for TensorFlow modeling. The comprehensive approach demonstrated many advanced signal processing techniques that should provide practitioners who wish to use audio data for their own applications with a good starting point.

You're reading from The TensorFlow Workshop A hands-on guide to building deep learning models from scratch using real-world datasets

Table of Contents (13) Chapters

Audio Processing

Exercise 2.05: Loading Audio Data for TensorFlow Models

Activity 2.03: Loading Audio Data for Batch Processing

Authors (4)

Personalised recommendations for you