Skip to content Skip to footer
0 items - $0.00 0

AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

AudioX: Diffusion Transformer for Anything-to-Audio Generation by gnabgib

6 Comments

  • Post Author
    Fauntleroy
    Posted April 14, 2025 at 6:16 pm

    The video to audio examples are really impressive! The video featuring the band showcases some of the obvious shortcomings of this method (humans will have very precise expectations about the kinds of sounds 5 trombones will make)—but the tennis example shows its strengths (decent timing of hit sounds, eerily accurate acoustics for the large internal space). I'm very excited to see how this improves a few more papers down the line!

  • Post Author
    oezi
    Posted April 14, 2025 at 6:30 pm

    Audio, but not Speech, right?

  • Post Author
    teeklp
    Posted April 14, 2025 at 6:36 pm

    [dead]

  • Post Author
    gigel82
    Posted April 14, 2025 at 6:55 pm

    That "pseudo-human laughter" gave me some real chills; didn't realize uncanny valley for audio is a real thing but damn…

  • Post Author
    darkwater
    Posted April 14, 2025 at 9:04 pm

    The toilet flushing one is full of weird, unrelated noises.

    The tennis video, as other commented, is good but there is a noticeable delay between the action and the sound.
    And the "loving couple holding IA hands and then dancing", well, the input is already cringe enough.

    For all these diffusion models, look like we are 90% here, now we just need the final 90%.

  • Post Author
    kristopolous
    Posted April 14, 2025 at 9:12 pm

    really the next big leap is something that gives me more meaningful artistic control over these systems.

    It's usually "generate a few, one of them is not terrible, none are exactly what I wanted" then modify the prompt, wait an hour or so …

    The workflow reminds me of programming 30 years ago – you did something, then waited for the compile, see if it worked, tried something else…

    All you've got are a few crude tools and a bit of grit and patience.

    On the i2v tools I've found that if I modify the input to make the contrast sharper, the shapes more discrete, the object easier to segment, then I get better results. I wonder if there's hacks like that here.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.