Microsoft’s new language model Vall-E is reportedly able to imitate any voice using just a three-second sample recording.
The recently released AI tool was tested on 60,000 hours of English speech data. Researchers said in a paper out of Cornell University that it could replicate the emotions and tone of a speaker.
Those findings were apparently true even when creating a recording of words that the original speaker never actually said.
“Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot [text to speech] system in terms of speech naturalness and speaker similarity,” the authors wrote. “In addition, we find Vall-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”
ANDROID SPYWARE STRIKES AGAIN TARGETING FINANCIAL INSTITUTIONS AND YOUR MONEY

Microsoft Corporation booth signage is displayed at CES 2023 at the Las Vegas Convention Center on January 6, 2023, in Las Vegas, Nevada.
((Photo by David Becker/Getty Images))
The Vall-E samples shared on GitHub are eerily similar to the speaker prompts, although they range in quality.
In one synthesized sentence from the E