Microsoft unveils AI that can simulate your voice from just 3 seconds of audio

VALL-E language model can even imitate the original speaker’s emotional tone using artificial intelligence

Anthony Cuthbertson
Tuesday 10 January 2023 20:17 GMT
New AI can accurately imitate a person's voice

Microsoft has unveiled an AI voice simulator capable of accurately immitating a person’s voice after listening to them speak for just three seconds.

The VALL-E language model was trained using 60,000 hours of English speech from 7,000 different speakers in order to synthesize “high-quality personalised speech” from any unseen speaker.

Once the artificial intelligence system has a person’s voice recording, it is able to make it sound like that person is saying anything. It is even able to imitate the original speaker’s emotional tone and acoustic environment.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot text to speech synthesis (TTS) system in terms of speech naturalness and speaker similarity,” a paper describing the system stated.

“In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

Potential applications include authors reading entire audiobooks from just a sample recording, videos with natural language voiceovers, and filling in speech for a film actor if the original recording was corrupted.

As with other deepfake technology that imitates a person’s visual likeness in videos, there is the potential for misuse.

The VALL-E software used to generate the fake speech is currently not available for public use, with Microsoft citing “potential risks in misuse of the medel, such as spoofing voice identification or impersonating a specific speaker”.

Microsoft said it would also abide by its Responsible AI Principles as it continues to develop VALL-E, as well as consider possible ways to detect synthesized speech in order to mitigate such risks.

Microsoft trained VALL-E using voice recordings in the public domain, mostly from LibriVox audiobooks, while the speakers who were imitated took part in the experiments willingly.

“When the model is generalised to unseen speakers, relevant components should be accompanies by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech,” Microsoft researchers said in an ethics statement.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies


Thank you for registering

Please refresh the page or navigate to another page on the site to be automatically logged inPlease refresh your browser to be logged in