Discover more from Teaching computers how to talk
A small window into the world of conversational AI. Join 1400+ consultants, creatives, and founders who also subscribed to this newsletter.
Over 1,000 subscribers
Stephen Fry spoke at the CogX Festival in London last week. During that talk, he played a video clip of a WOII documentary narrated by his voice. The voice heard in the clip is in fact not his — it’s a clone of his voice, trained on seven Harry Potter audio books that he patiently and painstakingly has narrated in the past. The cloned voice was then used to narrate the documentary without Fry’s knowledge, or permission.
The voice is so good that you can’t tell the difference; even German words like Hauptsturmführer and Dutch place names are pronounced flawlessly. In his talk, Fry explains:
“This not the result of a mash-up; there are plenty of those and they are completely obvious. This is from a flexible, artificial voice where the words are modulated to fit the meaning of each sentence uniquely. It could therefore have me read anything from a call to storm parliament, to hard porn, to product endorsements. All without my knowledge.”
He is of course right and all of this is made possible due to major leaps in the field of speech synthesis. New technology that is being developed and commercialized by companies like ElevenLabs, Speechify, and Resemble.ai, which allow you to clone a voice with sometimes as little as minutes worth of voice data.
It’s both a profound and beyond terrifying development. Who owns our voice, if anyone can create a copy of our voice for a few dollars and make it say whatever they want to? And what else do we stand to lose if we cannot distinguish a real from fake voice no more?
Computer generated voices have been around for a while, but they were always pretty shit. The technology used spliced up, pre-recorded words and phrases, glued together to match the desired output.
Deep learning turned things on its head. Machine learning algorithms learn from pre-recorded audio and pick up on all the speech patterns that make a voices unique, such as rhythm, pace, intonation, and pronunciation. As of today, synthetic voices have achieved unprecedented sophistication: they can drop a pause at the right moments, imitate um’s and ah’s, and even mastered nonverbal sounds like yawning, sighing and chuckling.
These are life-like voices that are indistinguishable to the human ear. It’s fast, readily available and relatively cheap technology — and it’ll only get cheaper in the future, as the cost of compute is consistently going down every year, in line with Moore’s law.
Of course, there’s a great deal of mundane utility to be extracted. Think audiobooks, voice dubbing, voice assistants, social media content, podcasts, and video games. Last week, I created a YouTube Short (as a little experiment and promotion for the newsletter) and used a pre-made synthetic voice to read out a script that I wrote; the voiceover was generated in mere seconds.
However, there’s a dark side to all of