Augmenting Whisper with speaker diarization
Speaker diarization, partitioning an audio stream into segments according to the speaker’s identity, is a powerful feature in multispeaker speech processing. It addresses the question of who spoke when? In a given audio clip, it is crucial to enhance the functionality and usability of ASR systems. The origins of speaker diarization can be traced back to the 1990s when the foundational work for clustering-based diarization paradigms was laid down. These early studies focused on radio broadcast news and communications applications, primarily aiming to improve ASR performance. The features used in these early studies were handcrafted mainly, with Mel-frequency cepstral coefficients (MFCCs) being a common choice.
Over time, the field of speaker diarization has seen significant advancements, particularly with the emergence of deep learning technology. Modern diarization systems often leverage neural networks and large-scale GPU computing...