VoiceBox by Meta AI

Voicebox is a generative AI model developed by Meta AI. Its primary functions include audio editing, sampling, and styling. Notably, Voicebox can accomplish speech generation tasks that it wasn’t specifically trained for using in-context learning, demonstrating its versatility and adaptive capabilities.

Important: Meta does not release the model yet, because the risk of abuse is currently too high.

One of its key features is the production of high-quality audio clips and the editing of pre-recorded audio. For example, it can remove unwanted noises like car horns or dog barking from a recording while preserving the content and style of the audio. Voicebox is multilingual, capable of producing speech in six different languages.

Furthermore, it introduces an array of new capabilities:

  • In-context text-to-speech synthesis: With a short audio sample (as brief as two seconds), Voicebox can match the audio style and use it for text-to-speech generation.
  • Speech editing and noise reduction: It can recreate a portion of speech that’s been interrupted by noise or replace misspoken words without needing to re-record an entire speech.
  • Cross-lingual style transfer: Given a sample of someone’s speech and a passage of text in one of six languages (English, French, German, Spanish, Polish, or Portuguese), Voicebox can produce a reading of the text in any of those languages, even when the sample speech and the text are in different languages.
  • Diverse speech sampling: Voicebox has been trained on diverse data, which allows it to generate speech that’s more representative of how people talk in the real world and in the six languages mentioned above.

In the future, technologies like Voicebox could be used to help creators easily edit audio tracks, allow visually impaired people to hear written messages from friends in their voices, and enable people to speak any foreign language in their own voice. It could also give natural-sounding voices to virtual assistants and non-player characters in the metaverse. This innovation is a significant step forward in generative AI research, and Meta AI anticipates further exploration in the audio space and the potential for other researchers to build upon their work​1​.