Pushing the performance of Whisper to the limits
The Whisper models from OpenAI are best-in-class in the field of automatic speech recognition in terms of quality. However, transcribing audio with these models still takes time. Is there a way to reduce the required transcription time?
Of course, it is always possible to upgrade hardware. However, it is wise to start with the software. This brings us to the project faster-whisper. This project implemented the OpenAI Whisper model in CTranslate2. CTranslate2 is a library for efficient inference with transformer models. This is made possible by applying various methods to increase efficiency, such as weight quantization, layer fusion, batch reordering, etc.
In the case of the project faster-whisper, a noticeable performance boost was achieved. Let’s start with the GPU:
The original large-v2 Whisper model takes 4 minutes and 30 seconds to transcribe 13 minutes of audio on an NVIDIA Tesla V100S, while the faster-whisper model only takes 54 seconds. Also, the required VRAM drops dramatically. The original model has a VRAM consumption of around 11.3 GB. Faster-whisper reduces this to 4.7 GB. But we can do even better. If we use an integer-8 precision instead of floating-point 16 precision, the VRAM usage is reduced to 3.1 GB. That’s almost 4x smaller than the original model.
Even on the CPU, a noticeable boost can be observed. In the previous article, I neglected the CPU version a bit, as it had a drastic difference from the GPU. In most cases, a CPU is not recommended for inference. But here, we can also look at the results on the CPU. The classic OpenAI Whisper small model can do 13 minutes of audio in 10 minutes and 31 seconds on an Intel(R) Xeon(R) Gold 6226R. Faster-whisper can transcribe the same audio file in 2 minutes and 44 seconds.
The numbers from above were provided by the author of the package. Let’s check them with the same audio file from my the previous article and compare the results. Please note, that I used the NVIDIA T4 GPU.
|Time in seconds
|OpenAI Whisper (T4)
We got a 4.4x decrease in inference time on the same GPU. As a result, the differences are somewhere similar to what the description of the author stated.
The best thing about faster-whisper is the easy installation process. It can be installed using
pip and can be easily run.
from faster_whisper import WhisperModel
model_size = "large-v2"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
It is important that the final text is transcribed in the loop and is not returned immediately after
segments is only a generator instead of a string.
This relatively simple performance boost is already very helpful. But can it be even faster?
Yes, it can: Introducing Whisper JAX. A new reimplementation of Whisper with the goal of making the model run as fast as possible on a TPU. To reach the goal, the model has been implemented in JAX. But what is JAX and what is a TPU?
JAX and TPU?
The open source library JAX provides NumPy-like APIs and functions while enabling them to leverage Just-In-Time (JIT) compilation provided by XLA (Accelerated Linear Algebra), a domain-specific compiler. This allows for better performance and scalability, especially when working with large-scale data and models.
A TPU is a Tensor Processing Unit, a special type of microchip developed by Google and optimized for machine learning. While GPUs offer huge performance gains in the area of mathematical computations, special hardware can expend this by focusing solely on the mathematical operations common in machine learning. TPUs can perform large amounts of tensor operations quickly and efficiently, improving the performance and accuracy of neural networks. The downside of a TPU is that they are only available at Google through their own Cloud offerings.
As a result, we get another significant performance boost. A 1-hour-long audio file can be transcribed in 13.8 seconds on a TPUv4-8, instead of 1001.0 seconds on an Nvidia A100 40 GB server GPU.
The numbers from above are also provided by the author of the package. Let’s check them with the same audio file from my the previous article and compare the results. I also added the data for the official OpenAI Whisper API.
|Time in seconds
|OpenAI Whisper (T4)
|OpenAI Whisper (API)
|Whisper JAX (TPU v4-8)
I guess, the result speaks for themselves. Whisper JAX is 17x faster than the official API. We also see a more drastic difference to faster-whisper. However, a quality comparison between faster-whisper and Whisper JAX cannot be made because different hardware was used in the tests.
In summary, Faster Whisper has significantly improved the performance of the OpenAI Whisper model by implementing it in CTranslate2, resulting in reduced transcription time and VRAM consumption. In addition, Whisper JAX further enhances performance by leveraging TPUs and the JAX library to significantly increase transcription speed for large-scale audio processing tasks.