Microsoft unveils AI that can simulate your voice from just 3 seconds of audio

VALL-E language model can even imitate the original speaker’s emotional tone using artificial intelligence

Tuesday 10 January 2023 20:17 GMT

New AI can accurately imitate a person's voice

Support truly
independent journalism

Our mission is to deliver unbiased, fact-based reporting that holds power to account and exposes the truth.

Whether $5 or $50, every contribution counts.

Support us to deliver journalism without an agenda.

Louise Thomas

Editor

Microsoft has unveiled an AI voice simulator capable of accurately immitating a person’s voice after listening to them speak for just three seconds.

The VALL-E language model was trained using 60,000 hours of English speech from 7,000 different speakers in order to synthesize “high-quality personalised speech” from any unseen speaker.

Once the artificial intelligence system has a person’s voice recording, it is able to make it sound like that person is saying anything. It is even able to imitate the original speaker’s emotional tone and acoustic environment.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot text to speech synthesis (TTS) system in terms of speech naturalness and speaker similarity,” a paper describing the system stated.

“In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

Potential applications include authors reading entire audiobooks from just a sample recording, videos with natural language voiceovers, and filling in speech for a film actor if the original recording was corrupted.

As with other deepfake technology that imitates a person’s visual likeness in videos, there is the potential for misuse.

The VALL-E software used to generate the fake speech is currently not available for public use, with Microsoft citing “potential risks in misuse of the medel, such as spoofing voice identification or impersonating a specific speaker”.

Microsoft said it would also abide by its Responsible AI Principles as it continues to develop VALL-E, as well as consider possible ways to detect synthesized speech in order to mitigate such risks.

Microsoft trained VALL-E using voice recordings in the public domain, mostly from LibriVox audiobooks, while the speakers who were imitated took part in the experiments willingly.

“When the model is generalised to unseen speakers, relevant components should be accompanies by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech,” Microsoft researchers said in an ethics statement.

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

US EditionChange

Thank you for registering

Microsoft unveils AI that can simulate your voice from just 3 seconds of audio

VALL-E language model can even imitate the original speaker’s emotional tone using artificial intelligence

Support truly
independent journalism

Join our commenting forum

New AI can simulate your voice from just 3 seconds of audio

Thank you for registering

US EditionChange

Support trulyindependent journalism

Find out moreClose

Join our commenting forum

Subscribe to Independent Premium to bookmark this article

Thank you for registering

US EditionChange

Support truly
independent journalism