In an innovation in generative AI for voice, Meta has created Voicebox, a cutting-edge AI model that can produce speech in ways that it wasn’t particularly trained to, such as editing, sampling, and stylizing.
Voicebox has the ability to edit and produce high-quality audio snippets while preserving the tone and content of the original recording. It can eliminate noises like dog barking or car horns, for instance. The model is bilingual and speaks six different languages.
Future virtual assistants and non-player characters in the metaverse might have real-sounding voices thanks to versatile generative AI models like Voicebox. They might give producers new tools to quickly build and edit audio tracks for videos, enable visually impaired people to hear written messages from pals read aloud by AI, and much more.
Voicebox’s adaptability permits a range of jobs, such as:
In-context text-to-speech synthesis: Voicebox can produce text-to-speech from audio samples as short as two seconds long by matching the audio style.
Voicebox can substitute misspelled words or reconstruct a section of the speech that was cut off by noise without having to re-record the complete speech. For instance, you can clip a speech segment that was interrupted by a dog barking and tell Voicebox to create a new version of that piece, acting as an eraser for audio editing.
Cross-lingual style transfer: Voicebox can produce a reading of the text in any of these languages when given a sample of someone’s speech and a passage of text in English, French, German, Spanish, Polish, or Portuguese, even if the sample speech and the text are in different languages.
Regardless of whether two individuals speak a similar language, they could possibly speak normally and genuinely later on thanks to this skill.
Diverse speech sampling: Voicebox can produce speech that is more reminiscent of how people speak in the real world and in the six languages mentioned above after learning from a variety of data.
Voicebox represents a significant advancement in our generative AI research, and we anticipate expanding our investigation into the audio field and watching how other academics expand on our findings.