Audio Processing
This section will demonstrate how to load audio data in batches, as well as how to process it so that it can be used to train machine learning models. There is some advanced signal processing that takes place to preprocess audio files. Some of these steps are optional, but they are presented to provide a comprehensive approach to processing audio data. Since each audio file can be hundreds of KB, you will utilize batch processing, as you did when processing image data. Batch processing can be achieved by creating a dataset object. A generic method for creating a dataset object from raw data is using TensorFlow's from_tensor_slice
function. This function generates a dataset object by slicing a tensor along its first dimension. It can be used as follows:
dataset = tf.data.Dataset\ .from_tensor_slices([1, 2, 3, 4, 5])
Loading audio data into a Python environment can be achieved using TensorFlow by reading the file into memory using the read_file
function, then decoding the file using the decode_wav
function. When using the decode_wav
function, the sample rate, which represents how many data points comprise 1 second of data, as well as the desired channel to use must be passed in as arguments. For example, if a value of -1
is passed for the desired channel, then all the audio channels will be decoded. Importing the audio file can be achieved as follows:
sample_rate = 44100 audio_data = tf.io.read_file('path/to/file') audio, sample_rate = tf.audio.decode_wav\ (audio_data,\ desired_channels=-1,\ desired_samples=sample_rate)
As with text data, you must preprocess the data so that the resulting numerical tensor has the same size as the data. This is achieved by sampling the audio file after converting the data into the frequency domain. Sampling the audio can be thought of as splitting the audio file into chunks that are always the same size. For example, a 30-second audio file can be split into 30 1-second non-overlapping audio samples, and in the same way, a 15-second audio file can be split into 15 1-second non-overlapping samples. Thus, your result is 45 equally sized audio samples.
Another common preprocessing step that can be performed on audio data is to convert the audio sample from the time domain into the frequency domain. Interpreting the data in the time domain is useful for understanding the intensity or volume of the audio, whereas the frequency domain can help you discover which frequencies are present. This is useful for classifying sounds since different objects have different characteristic sounds that will be present in the frequency domain. Audio data can be converted from the time domain into the frequency domain using the stft
function.
This function takes the short-time Fourier transform of the input data. The arguments to the function include the frame length, which is an integer value that indicates the window length in samples; the frame step, which is an integer value that describes the number of samples to step; and the Fast Fourier Transform (FFT) length, which is an integer value that indicates the length of the FFT to apply. A spectrogram is the absolute value of the short-time Fourier transform as it is useful for visual interpretation. The short-time Fourier transform and spectrogram can be created as follows:
stfts = tf.signal.stft(audio, frame_length=1024,\ frame_step=256,\ fft_length=1024) spectrograms = tf.abs(stfts)
Another optional preprocessing step is to generate the Mel-Frequency Cepstral Coefficients (MFCCs). As the name suggests, the MFCCs are the coefficients of the mel-frequency cepstrum. The cepstrum is a representation of the short-term power spectrum of an audio signal. MFCCs are commonly used in applications for speech recognition and music information retrieval. As such, it may not be important to understand each step of how the MFCCs are generated but understanding that they can be applied as a preprocessing step to increase the information density of the audio data pipeline is beneficial.
MFCCs are generated by creating a matrix to warp the linear scale to the mel scale. This matrix can be created using linear_to_mel_weight_matrix
and by passing in the number of bands in the resulting mel spectrum, the number of bins in the source spectrogram, the sample rate, and the lower and upper frequencies to be included in the mel spectrum. Once the linear-to-mel weight matrix has been created, a tensor contraction with the spectrograms is applied along the first axis using the tensordot
function.
Following this, the log of the values is applied to generate the log mel spectrograms. Finally, the mfccs_from_log_mel_spectrograms
function can be applied to generate the MFCCs that are passing in the log mel spectrograms. These steps can be applied as follows:
lower_edge_hertz, upper_edge_hertz, num_mel_bins \ = 80.0, 7600.0, 80 linear_to_mel_weight_matrix \ = tf.signal.linear_to_mel_weight_matrix\ (num_mel_bins, num_spectrogram_bins, sample_rate, \ lower_edge_hertz, upper_edge_hertz) mel_spectrograms = tf.tensordot\ (spectrograms, \ linear_to_mel_weight_matrix, 1) mel_spectrograms.set_shape\ (spectrograms.shape[:-1].concatenate\ (linear_to_mel_weight_matrix.shape[-1:])) log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6) mfccs = tf.signal.mfccs_from_log_mel_spectrograms\ (log_mel_spectrograms)[..., :num_mfccs]
In the following exercise, you will understand how audio data can be processed. In a similar manner to what you did in Exercise 2.03, Loading Image Data for Batch Processing, and Exercise 2.04, Loading Text Data for TensorFlow Models, you will load the data in batches for efficient and scalable training. You will load in the audio files using TensorFlow's generic read_file
function, then decode the audio data using TensorFlow's decode_wav
function. You will then create a function that will generate the MFCCs from each audio sample. Finally, a dataset object will be generated that can be passed into a TensorFlow model for training. The dataset that you will be utilizing is Google's speech commands dataset, which consists of 1-second-long utterances of words.
Note
The dataset can be found here: https://packt.link/Byurf.
Exercise 2.05: Loading Audio Data for TensorFlow Models
In this exercise, you'll learn how to load in audio data for batch processing. The dataset, data_speech_commands_v0.02
, contains speech samples of people speaking the word zero
for exactly 1 second with a sample rate of 44.1 kHz, meaning that for every second, there are 44,100 data points. You will apply some common audio preprocessing techniques, including converting the data into the Fourier domain, sampling the data to ensure the data has the same size as the model, and generating MFCCs for each audio sample. This will generate a preprocessed dataset object that can be input into a TensorFlow model for training.
Perform the following steps:
- Open a new Jupyter notebook to implement this exercise. Save the file as
Exercise2-05.ipnyb
. - In a new Jupyter Notebook cell, import the
tensorflow
andos
libraries:import tensorflow as tf import os
- Create a function that will load an audio file using TensorFlow's
read_file
function anddecode_wav
function, respectively. Return the transpose of the resultant tensor:def load_audio(file_path, sample_rate=44100): # Load audio at 44.1kHz sample-rate audio = tf.io.read_file(file_path) audio, sample_rate = tf.audio.decode_wav\ (audio,\ desired_channels=-1,\ desired_samples=sample_rate) return tf.transpose(audio)
- Load in the paths to the audio data as a list using
os.list_dir
:prefix = " ../Datasets/data_speech_commands_v0.02"\ "/zero/" paths = [os.path.join(prefix, path) for path in \ os.listdir(prefix)]
- Test the function by loading in the first audio file from the list and plotting it:
import matplotlib.pyplot as plt audio = load_audio(paths[0]) plt.plot(audio.numpy().T) plt.xlabel('Sample') plt.ylabel('Value')
The output will be as follows:
The figure shows the waveform of the speech sample. The amplitude at a given time corresponds to the volume of the sound; high amplitude relates to high volume.
- Create a function to generate the MFCCs from the audio data. First, apply the short-time Fourier transform passing in the audio signal as the first argument, the frame length set to
1024
as the second argument, the frame step set to256
as the third argument, and the FFT length as the fourth parameter. Then, take the absolute value of the result to compute the spectrograms. The number of spectrogram bins is given by the length along the last axis of the short-time Fourier transform. Next, define the upper and lower bounds of the mel weight matrix as80
and7600
respectively and the number of mel bins as80
. Then, compute the mel weight matrix usinglinear_to_mel_weight_matrix
from TensorFlow's signal package. Next, compute the mel spectrograms via tensor contraction using TensorFlow'stensordot
function along axis 1 of the spectrograms with the mel weight matrix. Then, take the log of the mel spectrograms before finally computing the MFCCs using TensorFlow'smfccs_from_log_mel_spectrograms
function. Then, return the MFCCs from the function:def apply_mfccs(audio, sample_rate=44100, num_mfccs=13): stfts = tf.signal.stft(audio, frame_length=1024, \ frame_step=256, \ fft_length=1024) spectrograms = tf.abs(stfts) num_spectrogram_bins = stfts.shape[-1]#.value lower_edge_hertz, upper_edge_hertz, \ num_mel_bins = 80.0, 7600.0, 80 linear_to_mel_weight_matrix = \ tf.signal.linear_to_mel_weight_matrix\ (num_mel_bins, num_spectrogram_bins, \ sample_rate, lower_edge_hertz, upper_edge_hertz) mel_spectrograms = tf.tensordot\ (spectrograms, \ linear_to_mel_weight_matrix, 1) mel_spectrograms.set_shape\ (spectrograms.shape[:-1].concatenate\ (linear_to_mel_weight_matrix.shape[-1:])) log_mel_spectrograms = tf.math.log\ (mel_spectrograms + 1e-6) #Compute MFCCs from log_mel_spectrograms mfccs = tf.signal.mfccs_from_log_mel_spectrograms\ (log_mel_spectrograms)[..., :num_mfccs] return mfccs
- Apply the function to generate the MFCCs for the audio data you loaded in Step 5:
mfcc = apply_mfccs(audio) plt.pcolor(mfcc.numpy()[0]) plt.xlabel('MFCC log coefficient') plt.ylabel('Sample Value')
The output will be as follows:
The preceding plot shows the MFCC values on the x axis and various points of the audio sample on the y axis. MFCCs are a different representation of the raw audio signal displayed in Step 5 that has been proven to be useful in applications related to speech recognition.
- Load
AUTOTUNE
so that you can use all the available threads of the CPU. Create a function that will take a dataset object, shuffle it, load the audio using the function you created in Step 3, generate the MFCCs using the function you created in Step 6, repeat the dataset object, batch it, and prefetch it. UseAUTOTUNE
to prefetch with a buffer size based on your available CPU:AUTOTUNE = tf.data.experimental.AUTOTUNE def prep_ds(ds, shuffle_buffer_size=1024, \ batch_size=64): # Randomly shuffle (file_path, label) dataset ds = ds.shuffle(buffer_size=shuffle_buffer_size) # Load and decode audio from file paths ds = ds.map(load_audio, num_parallel_calls=AUTOTUNE) # generate MFCCs from the audio data ds = ds.map(apply_mfccs) # Repeat dataset forever ds = ds.repeat() # Prepare batches ds = ds.batch(batch_size) # Prefetch ds = ds.prefetch(buffer_size=AUTOTUNE) return ds
- Generate the training dataset using the function you created in Step 8. To do this, create a dataset object using TensorFlow's
from_tensor_slices
function and pass in the paths to the audio files. After that, you can use the function you created in Step 8:ds = tf.data.Dataset.from_tensor_slices(paths) train_ds = prep_ds(ds)
- Take the first batch of the dataset and print it out:
for x in train_ds.take(1):\ print(x)
The output will be as follows:
The output shows the first batch of MFCC spectrum values in tensor form.
In this exercise, you imported audio data. You processed the dataset and batched the dataset so that it is appropriate for large-scale training. This method was a comprehensive approach in which the data was loaded and converted into the frequency domain, spectrograms were generated, and then finally the MFCCs were generated.
In the next activity, you will load in audio data and take the absolute value of the input, followed by scaling the values logarithmically. This will ensure that there are no negative values in the dataset. You will use the same audio dataset that you used in Exercise 2.05, Loading Audio Data for TensorFlow Models, that is, Google's speech commands dataset. This dataset consists of 1-second-long utterances of words.
Activity 2.03: Loading Audio Data for Batch Processing
In this activity, you will load audio data for batch processing. The audio preprocessing techniques that will be performed include taking the absolute value and using the logarithm of 1 plus the value. This will ensure the resulting values are non-negative and logarithmically scaled. The result will be a preprocessed dataset object that can be input into a TensorFlow model for training.
The steps for this activity are as follows:
- Open a new Jupyter notebook to implement this activity.
- Import the TensorFlow and
os
libraries. - Create a function that will load and then decode an audio file using TensorFlow's
read_file
function followed by thedecode_wav
function, respectively. Return the transpose of the resultant tensor from the function. - Load the file paths into the audio data as a list using
os.list_dir
. - Create a function that takes a dataset object, shuffles it, loads the audio using the function you created in step 2, and applies the absolute value and the
log1p
function to the dataset. This function adds1
to each value in the dataset and then applies the logarithm to the result. Next, repeat the dataset object, batch it, and prefetch it with a buffer size equal to the batch size. - Create a dataset object using TensorFlow's
from_tensor_slices
function and pass in the paths to the audio files. Then, apply the function you created in Step 4 to the dataset created in Step 5. - Take the first batch of the dataset and print it out.
- Plot the first audio file from the batch.
The output will look as follows:
Note
The solution to this activity can be found via this link.
In this activity, you learned how to load and preprocess audio data in batches. You used most of the functions that you used in Exercise 2.05, Loading Audio Data for TensorFlow Models, to load in the data and decode the raw data. The difference between Exercise 2.05, Loading Audio Data for TensorFlow Models, and Activity 2.03, Loading Audio Data for Batch Processing, is the preprocessing steps; Exercise 2.05, Loading Audio Data for TensorFlow Models, involved generating MFCCs for the audio data, whereas Activity 2.03, Loading Audio Data for Batch Processing, involved scaling the data logarithmically. Both demonstrate common preprocessing techniques that can be used for all applications involving modeling on audio data.
In this section, you have explored how audio data can be loaded in batches for TensorFlow modeling. The comprehensive approach demonstrated many advanced signal processing techniques that should provide practitioners who wish to use audio data for their own applications with a good starting point.