OpenAI’s Whisper is one of the best Speech-to-Text models currently available on the market. The transcriptions are highly accurate and include information such as pauses and emphases. It outperforms in terms of quality major competitors. It can also process a variety of languages. In addition to automatic transcription, it is able to perform language translation from any supported language into English. Furthermore, this model is available as an open-source download under the MIT license.
There are different versions of the model available: tiny, base, small, medium, and large. Each version is improving the quality of the transcript while having longer interference times. However, how much does it cost to implement Whisper in the real world?
When OpenAI Whisper was released in September 2022, there was no option for an official API from OpenAI. If you wanted to use the model, you needed to find a place for hosting by yourself. This is much easier said than done. Because, as with any larger neural network nowadays, a GPU is more or less a mandatory requirement in order to avoid prolonging the interference time into eternity. So, a $5 instance from some run-of-the-mill cloud provider is not enough for hosting. Of course, there are services like Replicate or Hugging Face that specialize in hosting open source machine learning models. And on the other side, there is the elephant in the room: self-hosting.
On March 1, 2023, OpenAI announced that they are now offering an API endpoint for the Whisper model, just at the price of only $0.006 per transcribed minute. Additionally, this API uses OpenAI’s highly optimized GPU cluster, ensuring incredibly fast response times for requests. So, have all the previous problems been solved? Not really.
The Whisper API?
Firstly, what is the Whisper API about? It is based on large-v2 model, which is currently the best variant in the Whisper family. The API can handle audio recordings in various formats such as MP3, WAV and FLAC. The only restriction is the file size, which is limited to 25 MB.
For our experiment, I used an episode of the German (IT security) podcast Risikozone of my brother. Because the podcast file was over the 25 MB limit, it needed to be split in chunks. The test chunk was 24.2 MB big and 26.3 minutes long.
Our audio file needed around 80 seconds (≈ 1.2 minutes) to be transcribed, which is an impressive number for such complex tasks. All together, the transcription of the episode only costs $0.16.
Our own instance of Whisper
The alternative would be the creation of an own endpoint, which is not a simple calculation. Firstly, the transcription speed heavily depends on the hardware being used. For our example, I am using an NVIDIA T4. Of course, this Datacenter GPU is one of the weaker ones, but it is still obtainable almost everywhere. So, transcribing our file on the large-v2 variant of the Whisper model takes 763 seconds (12.7 minutes). This is almost half as long as the actual audio file, but still 10 times slower than OpenAI’s API. Now we also have to take care of an API on the internet by ourselves. But how much did it cost?
This question is also not so easy to solve. For a service that is always available, the instance would have to be running constantly, which means that it incurs costs without being actively used. Additionally, prices vary from different cloud providers. For this example, I generally use the prices of the Google Cloud Service. Here, running a theoretical instance for a month costs $275.94 (n1-standard-4 + 1x T4). Therefore, it is slower and much more expensive. A clear win for OpenAI?
Oh, you can shut down computers?
For a service that uses such a function only a few times a month, the costs of self-hosting are simply too high. The case of fast transcription is just off the table. To achieve that, one would have to look for stronger GPUs, which further drive up the price. But what about the case of processing asynchronously?
First, let’s break down the costs by hour. In this case, the instance costs $0.54 per hour (n1-standard-4 + 1x T4), with the GPU itself costing only $0.35 per hour. This translates to $0.0058 per minute for the instance. So, are we close to OpenAI’s prices? Let’s not confuse minute with minute here. On our instance, we calculate in minutes of active VM usage, not minutes of transcribed audio. Our instance took 12.7 minutes for our audio file, which translates to $0.07366. Half as expensive as OpenAI.
And there’s still room for improvements. With our own instance, we can now use the smaller and weaker variants of Whisper to get faster results. Medium is roughly twice as fast as large-v2. Our audio file only took 457 seconds here, which would cost $0.044. Also, with our own instance, we have no limitations on file size.
So, what does this mean for our project or business. To put it simply, in 80% percent of cases, I would recommend the API from OpenAI. It is just the simpler option in terms of speed and general availability. Especially, when you have to handle much fluctuation in user requests. But when you have more control about your incoming data flow and the processing time, you should take a look at the hosting options. It could make a big difference.
Generally, this article has a bit of a journey behind it. The first drafts were written during a time when OpenAI did not provide an API for Whisper. So, originally it was a comparison between different vendors. However, even with OpenAI’s API, the primary direction of the article has not changed.
It is a very interesting time in which we live. The AI technology space moves currently at such a rapid speed as not seen for years. Nearly every 24 hours, something new and mind-blowing enters the market. At the same time, the general directions of this technology are slowly emerging. OpenAI is currently the gold standard for LLMs as competitors like Google’s Bard have fallen flat. I wouldn’t be suprised if OpenAI shifts to ChatGPT-centric technology instead of open models or integratable developer APIs. With Plugins, OpenAI made a ton of applications or integrations obsolete while locking in the customers and developers to their own platform.
On the other side, the open source community made it easier than ever to integrate LLM models. At the same time, they catch up with OpenAI in regard to open source LLMs. Correctly, this area has some of the biggest breakthroughs I ever saw in such a little time. I still think that one day we will have LLMs in the capability range between GPT-3 and GPT-3.5 (maybe up to GPT-4) – on our mobile devices. I don’t think that the current idea of throwing all (internal) knowledge to only one vendor is very future-proof. Just from a business perspective, not having full control over your model and data is not a good starting point.
At the same time, the field of artificial intelligence is not unknown to technology winters. In the 1980s, nearly every major U.S. corporation had its own AI group and was either using or investigating expert systems. These systems had knowledge manually programmed in and tried to replicate the decision-making process of experts. However, they did not become commercially viable and were only a generation step for the current deep learning. So, there is still the possibility, that as fast the technology went public, it can vanish after hitting some big technological limits (for example training and inference becomes too expensive). And then, there is still the huge issue of alignment or the implication and reaction of society to such a big technological shift, but this is something for its own article.
So, let’s see what the future holds for use.