Making OpenAI Whisper better

We already looked at ways to make the original OpenAI Whisper model faster. We came across two different projects that aimed to deliver the best performance in terms of transcription speeds. But Automatic Speech Recognition is actually not about speed. The primary focus is on factors like quality and functionality. The overall quality of Whisper is high, but what about its functionality?

All about transcriptions

The original Whisper model from OpenAI has already a particularly interesting set of functionalities. You have the possibility to simply transcribe and translate audio and return it to a text file. In this area, the base model should be enough.

But sometimes, it is necessary to know the exact time of a spoken word or sentence. For these cases, we need timestamps. The timestamp generation is already a part of the original Whisper model. When using the CLI version of Whisper, files with timestamps (srt and vtt) are generated in addition to the text file. Even the API has the option to return timestamps (although the official documentation is pretty silent about this feature). However, the quality in terms of accuracy and precision are still improvable. Especially, our next use case needs a certain level of precision.

Detect the speaker

Basic transcription fails when it comes to multi speaker audio. It is usually very difficult to determine the different speakers from a raw transcript. Whisper does not have the ability to recognize and separate speakers. Nor is it a task that can be quickly resolved by a small script.

In the end, another machine learning model is needed. Of course, this is not a completely new problem, so there are already several projects that are trying to find a solution. One of the best projects in this area is speaker diarization by pyannote.audio. Its model is only available on HuggingFace, a platform for Transformer models. Speaker diarization is also more or less a pipeline because several models are used. The models pyannote/speaker-diarization and pyannote/segmentation are needed, and the terms and conditions of both models must be accepted. Therefore, having an HuggingFace account is necessary to agree to the terms. Furthermore, the model can only be downloaded if a HuggingFace token is provided. The token can be obtained in the settings.

The following code will download and run the model:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1", use_auth_token="ACCESS_TOKEN_GOES_HERE")

diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

After running the the script we got the following output

start=0.2s stop=1.5s speaker_0
start=1.8s stop=3.9s speaker_1
start=4.2s stop=5.7s speaker_0

Speakers can now be separated. However, we now need to merge this output with the traditional transcription, which can be a bit complex. Is there a better way?

WhisperX

Let’s introduce WhisperX. The WhisperX project tries to deliver a solution for both problems.

Firstly, WhisperX uses faster-whisper for transcription, which results in the same fast pace and low resource consumption that I described in the last article. In addition to faster-whisper, it uses Voice Activity Detection (VAD), which detects areas in which no speech occurs. It also enables better batching of audio clips and therefore faster processing. Also, it attempts to improve the quality of transcription timestamps by using Phoneme-Based ASR, which recognizes the finest units of a word.

This is supported by Forced Alignment. It involves taking written transcripts of spoken language and synchronizing them with the corresponding audio recordings. This is done by matching the written text, including individual words and phonemes (the smallest units of sound), to their exact start and end times within the audio sample. Finally, speaker diarization is performed using pyannote.audio, as described above.

The reality

All of this sounds like a huge package. But to everyone’s surprise, WhisperX has similar requirements like faster-whisper. Therefore, transcription is still possible using an NVIDIA RTX 2060 with 6GB VRAM on the large-v2 model. In contrast, minimum requirements have risen and 4-5 GB VRAM are now a requirement due to the new models. But still, approximately 1 hour of content can be transcribed in 10 minutes, which is an impressive performance. (It is helpful to have the file in WAV format because speaker diarization using pyannote has potential issues with MP3 files.)

Additionally, using WhisperX is not difficult at all. A simple CLI is all that is required.

whisperx sample01.wav

For speaker diarization, you just need to add:

whisperx sample01.wav --diarize --hf_token HF-TOKEN

Conclusion

Now it’s already my third blog article about Whisper. The first blog article was a reflection of a time when there was no easy and cost-effective way to host Whisper. This changed dramatically with the official API. However, I found still some arguments for self-hosting. In the second article, the speed of one’s own Whisper instance was dramatically improved by faster-whisper and Whisper JAX. Now, it’s all about the functionality and quality. In conclusion, it can be said that if you provide tools to the community, someone in the world will be surprised by what this tool can be used for. Its results may surprise the whole world. Just like AlexNet made GPUs popular in AI, open models like Whisper, LLaMA and Falcon will also bring new and interesting innovations.