Three seconds of input is enough: Microsoft's AI Vall-E imitates human speech

With Vall-E, Microsoft has introduced an AI that can imitate human speech even with extremely short audio inputs. In order to imitate the speaker, the text-to-speech (TTS) AI model only needs a three-second original file of its model. Then she reads any text of the human role model with his voice.

Microsoft calls its new AI model Vall-E the “Neural Codec Language”. Microsoft reports on Github, the AI generates “high-quality, personalized speech” which, according to initial experiments, “clearly surpasses the most modern zero-shot TTS system in terms of the naturalness of the speech and the similarity of the speakers.” In addition to the neutral reproduction of texts, the AI also masters emotions and can take into account and correctly reproduce audio artifacts, such as poor voice quality in telephone calls, during speech synthesis.

Impressive sample files

As the many audio examples in the GitHub demo, it works – at least in places – actually already very well. In addition to the text to be spoken, the demo files contain the “speaker prompt”, i.e. the three-second individual sample. Under “Ground Truth” the text is audible as it was actually read by the person, “Baseline” reflects the result of a conventional TTS synthesis model. Finally, in the column on the far right, you can hear Vall-E’s result, which sometimes resembles “Ground Truth” more, sometimes less.

Technically, Vall-E uses tokens to break down the three-second audio sample into specific speech characteristics. From these tokens, the speech AI uses the training data to derive how the voice of the person speaking would behave with other terms. Over 60,000 hours of audio files in English were used in the pre-training for Vall-E. The training data comes from Meta’s LibriLight dataset and mainly contains freely accessible audio books. At present, the researchers continue, Vall-E is primarily generating realistically those voices that resemble one of the models in the training datasets.

Great potential for abuse

Vall-E could be used for conventional TTS tasks, but also for speeches by public figures, where the spoken word could be changed later with the help of the AI. Microsoft’s AI is by no means the first speech synthesis based on natural language – but what is new about Vall-E is the extremely short audio input required.

Vall-E is thus the latest technology in the field of artificial intelligence, which has been developing rapidly in recent times – see the Vall-E namesake, the image AI Dall-E or the AI ChatGPT, which has recently been dominating the headlines.

Microsoft is probably also aware that the AI could also be used for Schindluder. Vall-E’s code is not currently available to the public. In addition, at the end of the demo, it is said that there is “a potential risk of misusing the model, such as tricking voice recognition or imitating a specific speaker.” Therefore, they also want to build a test model that can recognize whether an audio file is an original voice or just a Vall-E copy.