OpenAI’s New Voice AI Can Apologize Like It Actually Means It

According to TechCrunch, OpenAI is launching upgraded transcription and voice-generating AI models in its API, which the company claims enhance prior versions. This release aligns with OpenAI’s broader aim of creating automated systems that can autonomously perform tasks for users.

The new text-to-speech model, “gpt-4o-mini-tts,” provides more nuanced and realistic-sounding speech, characterized as more “steerable” than earlier speech-synthesizing models. Developers can instruct gpt-4o-mini-tts to modify speech based on the context, such as saying, “speak like a mad scientist” or adopting a serene tone akin to a mindfulness teacher.

Jeff Harris, a member of OpenAI’s product staff, stated that the objective is to allow developers to customize both the voice experience and context. “In different contexts, you don’t just want a flat, monotonous voice,” he explained. For instance, in a customer support scenario where an apology is warranted, developers can configure the voice to convey that emotion. Harris emphasized that developers and users should have substantial control over both the content and manner of spoken outputs.

Below are some shared samples (via TechCrunch):

Regarding the new speech-to-text models, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” these models replace OpenAI’s previous Whisper transcription model. Trained using diverse, high-quality audio datasets, these new models are designed to improve the capturing of varied speech, even in noisy environments.

They also offer a significant reduction in the production of inaccuracies, as noted by Harris. The earlier Whisper model was known to generate false transcriptions, including fabricated words and incorrect content. “These models are much improved versus Whisper on that front,” Harris remarked, asserting that precision in speech recognition is vital for delivering a reliable voice experience.

OpenAI launches o1-pro: A costly upgrade for developers

However, the transcription accuracy may vary by language. OpenAI’s internal benchmarks indicate that gpt-4o-transcribe, noted for its accuracy, approaches a “word error rate” of 30% for Indic and Dravidian languages such as Tamil, Telugu, Malayalam, and Kannada. This means that approximately three out of every ten words may differ from a human-generated transcription in these languages.

In a departure from past practices, OpenAI has opted not to release these new transcription models under an open-source license. Historically, new versions of Whisper were made available for commercial use under an MIT license. According to Harris, the gpt-4o-transcribe and gpt-4o-mini-transcribe models are significantly larger than Whisper, making local execution impractical for users’ devices. He noted, “[They’re] not the kind of model that you can just run locally on your laptop, like Whisper.”

Harris concluded by stating that OpenAI aims to responsibly release open-source models for specific needs, emphasizing the importance of honing these models for particular applications.

Featured image credit: Zac Wolff/Unsplash