Imagine a professional musician being able to explore new compositions without having to play a single note on an instrument. Or an indie game developer populating virtual worlds with realistic sound effects and ambient noise on a shoestring budget. Or a small business owner adding a soundtrack to their latest Instagram post with ease. That’s the promise of AudioCraft — our simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals as opposed to MIDI or piano rolls.
-
Introducing CM3leon, a more efficient, state-of-the-art generative model for text and images
-
Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance
-
Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
RECOMMENDED READS
AudioCraft consists of three models: MusicGen, AudioGen, and EnCodec. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, which was trained on public sound effects, generates audio from text-based user inputs. Today, we’re excited to release an improved version of our EnCodec decoder, which allows for higher quality music generation with fewer artifacts; our pre-trained AudioGen model, which lets you generate environmental sounds and sound effects like a dog barking, cars honking, or footsteps on a wooden floor; and all of the AudioCraft model weights and code. The models are available for research purposes and to further people’s understanding of the technology. We’re excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art.
From text to audio with ease
In recent years, generative AI models including language models have made huge strides and shown exceptional abilities: from the generation of a wide-variety of images and video from text descriptions exhibiting spatial understanding to text and speech models that perform machine translation or even text or speech dialogue agents. Yet while we’ve seen a lot of excitement around generative AI for images, video, and text, audio has always seemed to lag a bit behind. There’s some work out there, but it’s highly complicated and not very open, so people aren’t able to readily play with it.
Generating high-fidelity audio of any kind requires modeling complex signals and patterns at varying scales. Music is arguably the most challenging type of audio to generate because it’s composed of local and long-range patterns, from a suite of notes to a global musical structure with multiple instruments. Generating coherent music with AI has often been addressed through the use of symbolic representations like MIDI or piano rolls. However, these approaches are unable to fully grasp the expressive nuances and stylistic elements found in music. More recent advances leverage self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music, feeding the raw audio into a complex system in order to capture long-range structures in the signal while generating quality audio. But we knew that more could be done in this field.
The AudioCraft family of models is capable of producing high-quality audio with long-term consistency, and it can be easily interacted with through a natural interface. With AudioCraft, we simplify the overall design of generative models for audio compared to prior work in the field — giving people the full recipe to play with the existing models that Meta has been developing over the past several years while also empowering them to push the limits and develop their own models.
AudioCraft works for music and sound generation and compression — all in the same place. Because it’s easy to build on and reuse, people who want to build better sound generators, compression algorithms, or music generators can do it all in the same code ba