AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

Share This Article

Sed ut perspiciatis unde.

¹HKUST

^†Corresponding authors

Abstract

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture.

Text-to-Audio Generation

Prompt:

Thunder and rain during a sad piano solo

Post Author

Fauntleroy

Posted April 14, 2025 at 6:16 pm

The video to audio examples are really impressive! The video featuring the band showcases some of the obvious shortcomings of this method (humans will have very precise expectations about the kinds of sounds 5 trombones will make)—but the tennis example shows its strengths (decent timing of hit sounds, eerily accurate acoustics for the large internal space). I'm very excited to see how this improves a few more papers down the line!

0Likes Log in to Reply
Post Author

oezi

Posted April 14, 2025 at 6:30 pm

Audio, but not Speech, right?

0Likes Log in to Reply
Post Author

teeklp

Posted April 14, 2025 at 6:36 pm

[dead]

0Likes Log in to Reply
Post Author

gigel82

Posted April 14, 2025 at 6:55 pm

That "pseudo-human laughter" gave me some real chills; didn't realize uncanny valley for audio is a real thing but damn…

0Likes Log in to Reply
Post Author

darkwater

Posted April 14, 2025 at 9:04 pm

The toilet flushing one is full of weird, unrelated noises.

The tennis video, as other commented, is good but there is a noticeable delay between the action and the sound.
And the "loving couple holding IA hands and then dancing", well, the input is already cringe enough.

For all these diffusion models, look like we are 90% here, now we just need the final 90%.

0Likes Log in to Reply
Post Author

kristopolous

Posted April 14, 2025 at 9:12 pm

really the next big leap is something that gives me more meaningful artistic control over these systems.

It's usually "generate a few, one of them is not terrible, none are exactly what I wanted" then modify the prompt, wait an hour or so …

The workflow reminds me of programming 30 years ago – you did something, then waited for the compile, see if it worked, tried something else…

All you've got are a few crude tools and a bit of grit and patience.

On the i2v tools I've found that if I modify the input to make the contrast sharper, the shapes more discrete, the object easier to segment, then I get better results. I wonder if there's hacks like that here.

0Likes Log in to Reply

AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

Share This Article

Newsletter

Abstract

Text-to-Audio Generation

HackTech

6 Comments

Fauntleroy

oezi

teeklp

gigel82

darkwater

kristopolous

Leave a comment Cancel reply

Editor's Choice

AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

Share This Article

Newsletter

Abstract

Text-to-Audio Generation

6 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter