Meta has created a new system that it says can generate convincing speech in a variety of styles – but will not release it for fear of the risks.
The new tool is called “Voicebox” and can be set to create outputs in different styles, new voices from scratch as well as with a sample. It makes speech across six languages, as well as a variety of other tools such as noise removal.
It says that it is a major development on previous speech systems that required specific training for each task. Instead, Voicebox can just be given raw audio and a transcription, and then be used to modify an audio sample.
It is far more effective than its competitors, Meta claimed in its announcement. It can generate words with a 5.9 per cent error rate compared to 1.9 per cent from competitor Vall-E, for instance, and do so as much as 20 times more quickly.
Meta said that it had been built on the foundation of a new model it called “Flow Matching”. That allows the system to learn from speech that has not been carefully labelled, so that it can be trained on more and more diverse data.
Voicebox was trained on 50,000 hours of speech and transcripts that came from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese, Meta said. Now that it has been trained, it can be given an audio recording and fill in the speech from the context, Meta said.
That could be used to create a realistic sounding voice from just two seconds of speech, for instance, potentially being used to bring voices to people who cannot speak or to add people’s voices into games. It could also be used to translate a passage of speech from one lanagueg to another in a way that keeps the style, Meta said, allowing people to talk to each other authentically even if they don’t speak the same language.
It could also be useful in more technical scenarios, such as audio editing, where it can be used to replace words that were not properly recorded, for instance.
But Meta said that the risks were such that it would not be releasing the model. It did not point to specific harms, but said that “as with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm”.
Numerous reports have warned that such systems could be used to copy people’s voices without their consent and in ways that could be harmful, such as creating fake videos of news events or using people’s voices to pose as them during scam calls, for instance.
“There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time,” Meta said in a statement. “While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility.”
It also pointed to a separate paper, published on its website, in which it detailed how it had built a “highly effective” system that can distinguish between authentic speech and audio that had been generated with Voicebox.
Join our commenting forum
Join thought-provoking conversations, follow other Independent readers and see their replies