On Friday, Meta announced the development of a product called “Voicebox” as part of its ongoing push into the increasingly competitive sector of AI. However, Meta Voicebox won’t be coming soon!
Voicebox, according to Meta, can do speech-generating jobs it was not particularly trained on, in contrast to earlier voice-generator platforms. The AI program can generate a potentially convincing amount of fresh speech that sounds like whoever was featured in the source clip using text input and a brief audio clip for context.
“In the future, multipurpose generative AI models like Voicebox could give natural-sounding voices to virtual assistants and non-player characters in the metaverse. They could allow visually impaired people to hear written messages from friends read by AI in their voices, give creators new tools to easily create and edit audio tracks for videos, and much more,” said Meta in its announcement.
Introducing Voicebox, a new breakthrough generative speech system based on Flow Matching, a new method proposed by Meta AI. It can synthesize speech across six languages, perform noise removal, edit content, transfer audio style & more.
More details on this work & examples ⬇️
— Meta AI (@MetaAI) June 16, 2023
What is Meta Voicebox?
Meta Voicebox is capable of editing, sampling, and stylizing speech even though they weren’t particularly trained to do so through in-context learning.
Voicebox can create high-quality audio clips, and edit pre-recorded audio while maintaining the style and content of the original recording. For example, it can take out automobile horns or dog barking. The model can speak in six different languages and is bilingual as well.
To aid other academics in understanding the tool’s potential, Meta has shared audio samples and a study article rather than releasing it in a fully operational condition.
“Voicebox is an important step forward in our generative AI research, and we look forward to continuing our exploration in the audio space and seeing how other researchers build on our work,” the company added.
Meta text-to-speech AI
Meta Voicebox can generate text-to-speech from audio samples as short as two seconds in length by matching the audio style.
Voicebox can repair misspelled words or reconstruct a section of the speech that was cut off by noise without having to re-record the complete speech. For instance, you can clip a speech segment that was interrupted by a dog barking and tell Voicebox to create a new version of that piece, acting as an eraser for audio editing.
The AI-generated Go Woke Go Broke song on TikTok and Twitter has put Jason Aldean in a difficult place
Meta Voicebox can provide a reading of the text in any of those languages when given a sample of someone’s speech and a passage of text in English, French, German, Spanish, Polish, or Portuguese, even if the sample speech and the text are in different languages. Even if two people don’t speak the same language, they might be able to converse naturally and authentically in the future thanks to this skill.
It can produce speech that is more reminiscent of how people speak in the real world and in the six languages mentioned above after learning from a variety of data.
Since the release of OpenAI’s ChatGPT in November of last year, artificial intelligence technologies, notably chatbots, have grown increasingly widespread, but due to the technology’s rapid advancements, world leaders are raising concerns about possible abuses of the technology.
Meta Voicebox is yet another artificial intelligence solution that could lead to possible misuse, and people might use it to trick others.
Possible AI risks
Deepfakes, often known as fake audio or video information, are one of the most prevalent ways AI is exploited for fraud. Deep learning algorithms are used to construct realistic images or sounds that resemble the appearance or voice of a real person to create deepfakes.
For instance, a fraudster could produce a bogus voicemail message or impersonate someone else on the phone by using a voice changer or voice generator. A phony video of someone saying or doing something they never did might also be made using video editing software or a face-swapping app.
Deepfakes can have detrimental effects on a person’s or an organization’s reputation, credibility, or privacy. For instance, a deepfake might be used to distribute untrue or harmful information about someone on social media or to extort someone by threatening to reveal a compromising video of them. D
They may also be used to fool individuals into thinking they are speaking to somebody they know and trust, like a relative, friend, or coworker, in order to get them to divulge personal information or give money.
Featured image credit: Dima Solomin on Unsplash