Milestone 4 – Transforming raw speech data into Mel spectrogram features
Speech can be considered a one-dimensional array that changes over time, with each point in the array representing the loudness or amplitude of the sound. To understand speech, we need to capture its frequency and acoustic features, which can be done by analyzing the amplitude.
However, speech is a continuous sound stream, and computers can’t handle infinite data. So, we must convert this continuous stream into a series of discrete values by sampling the speech at regular intervals. This sampling is measured in samples per second or Hertz (Hz). The higher the sampling rate, the more accurately it captures the speech, but it also means more data to store every second.
It’s important to ensure that the sampling rate of the audio matches what the speech recognition model expects. If the rates don’t match, it can lead to errors. For example, playing a sound sampled at 16 kHz at 8...