
This Voice Doesn’t Exist – Generative Voice AI by goleary
Recently it seems everybody is talking about generative AI. Deep learning-powered large language and text-to-image models like ChatGPT, Stable Diffusion, DALL-E and Midjourney have caused much fuss in the tech world, and beyond. Many include them among the most significant recent developments in AI. Whether or not you agree, the general sentiment seems to be that something very all-powerful has appeared. In 2023 we’ll hear about models that can help you draw or create videos. Much like questions about what’s the latest-greatest smartphone, we’ll soon be asking about what’s the latest-greatest foundation model. Yet for all this excitement, we feel there’s one area within generative media that’s still severely underhyped: voice AI. It’s also the area we seek to become leaders in. At Eleven, we rely on the potential unlocked by deep learning techniques each day to power our lifelike text-to-speech and voice cloning tools. And now, we’re also deploying our own generative model which lets you design entirely new synthetic voices from scratch.
Voice Generator – design a voice
Our users take to the platform daily to bring their characters alive – be it for audiobooks, games or fan fiction. We realized our current speaker bank is too small for everybody to find the voices that match their content needs while remaining exclusive to each user. Our solution was to let you design entirely new synthetic voices.
We had an idea for how we’d go about this which came as we unpacked the methods we currently use for speech synthesis and voice cloning. Both processes require a way of encoding the characteristics of a particular voice. Speaker embeddings are what carries this identity – they’re a vector representation of a speaker’s voice. We realized that we could sample from the distribution of speaker embeddings by training a dedicated model to let us create infinitely many new voices.
Since our users mostly look for specific speech characteristics, we needed to add a degree of control over the process. We expanded our model with conditioning to generate voices based on their characteristics. The model now lets you set certain basic parameters which establish the new voice’s core identity: gender, age, accent, pitch and speaking style. In other words, ev