To coincide with the ChatGPT API rollout, OpenAI today launched the Whisper API, a hosted version of the open source Whisper speech-to-text model that the company released in September.
Whisper, priced at $0.006 per minute, is an automatic speech recognition system that OpenAI says enables “robust” transcription in multiple languages, as well as translation from those languages to English. It takes files in various formats, including M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM.
Countless organizations have developed highly capable speech recognition systems, which are at the core of software and services from tech giants such as Google, Amazon and Meta. But what makes Whisper different is that it’s been trained on 680,000 hours of multilingual and “multitask” data collected from the web, according to OpenAI president and chairman Greg Brockman, leading to improved recognition of unique accents, background noise and technical jargon.
“We released a model, but it wasn’t really enough to get the whole developer ecosystem built around it,” Brockman said in a video call with AapkaDost yesterday afternoon. “The Whisper API is the same big model you can get open source, but we’ve optimized it to the max. It is much, much faster and extremely convenient.”
According to Brockman, there are plenty of obstacles when it comes to companies adopting speech transcription technology. According to a 2020 Statista survey, companies cite accuracy, accent or dialect-related recognition issues, and cost as the top reasons they haven’t embraced technology like tech-to-speech.
However, Whisper has its limitations, particularly in terms of next word prediction. Because the system has been trained on a large amount of noisy data, OpenAI warns that Whisper may include words in its transcripts that aren’t actually spoken — possibly because it’s both trying to predict the next word in audio and transcribing the audio recording itself. In addition, Whisper does not perform equally well in all languages, as it has a higher error rate when it comes to speakers of languages that are not well represented in the training data.
The latter is unfortunately nothing new in the world of speech recognition. Prejudice has long plagued even the best systems, with a 2020 Stanford study finding systems from Amazon, Apple, Google, IBM, and Microsoft made far fewer errors — about 19% — with users who are white than with users who are black.
Despite this, OpenAI sees Whisper’s transcription capabilities being used to improve existing apps, services, products, and tools. The AI-powered language learning app Speak already uses the Whisper API to power a new in-app virtual companion.
If OpenAI can break into the text-to-speech market in a major way, it could be quite profitable for the Microsoft-backed company. According to Allied Market Research, the segment could be worth $12.5 billion by 2031, compared to $2.8 billion in 2021.
“Our view is that we really want to be this universal intelligence,” Brockman said. “WWe want to be really, really flexible, be able to take whatever kind of data you have — whatever kind of task you want to accomplish — and be a power multiplier on that attention.