Skip to content Skip to footer
LLMs can see and hear without any training by T-A

LLMs can see and hear without any training by T-A

16 Comments

  • Post Author
    underdeserver
    Posted April 26, 2025 at 1:47 pm
  • Post Author
    jagged-chisel
    Posted April 26, 2025 at 1:50 pm

    Computers can receive input without any programming. Not sure what’s interesting here.

  • Post Author
    scribu
    Posted April 26, 2025 at 1:53 pm

    This seems to be a system to generate better prompts to be fed into a base multimodal model.

    Interesting, but title is definitely clickbait.

  • Post Author
    EncomLab
    Posted April 26, 2025 at 2:01 pm

    My photoresistor nightlight can "see" that it is dark and it "knows" to turn on the light – not only does it not have training, it does not have any code!

    And if you think that is amazing, my bi-metallic strip thermostat "feels" the temperature and then modifies the environment because it "knows" if it's hot to turn on the A/C, and if it's cold to turn on the heat – no training or code!

    All of this AI stuff is just unbelievably incredible – what a brave new world (of word games)!

  • Post Author
    viraptor
    Posted April 26, 2025 at 2:01 pm

    That looks like a classic Actor/Critic setup, yet it's not mentioned even once in the paper. Am I missing some large difference here?

  • Post Author
    lngnmn2
    Posted April 26, 2025 at 2:06 pm

    [dead]

  • Post Author
    JoBrad
    Posted April 26, 2025 at 2:19 pm

    Exactly how little training is "without any"? I'm assuming that companies haven't been spending billions trying to train LLMs to better understand things when they can do it without any training.

  • Post Author
    blogabegonija
    Posted April 26, 2025 at 2:27 pm

    [dead]

  • Post Author
    3rdworldeng
    Posted April 26, 2025 at 2:35 pm

    Find me Jose Monkey will do that too :-)

  • Post Author
    sega_sai
    Posted April 26, 2025 at 3:48 pm

    The paper certainly contradicts my expectation from the title. I.e. it does not present an LLM that can generate images without any access to images before.

  • Post Author
    vessenes
    Posted April 26, 2025 at 4:12 pm

    I’ve read the paper and the skeptical comments here, to wit: it’s just an actor/critic pipeline by another name.

    I’ll bite and say this is actually interesting — and the paper title is misleading.

    What they’ve done here is hooked up a text-only LLM to multimodal critics, given it (mostly) an image diffusion generation task, and asked it to improve its prompting of the multimodal generation by getting a set of scores back.

    This definitely works, based on their outputs. Which is to say, LLMs can, zero shot, with outside tool feedback, iteratively improve their prompting using only that tooling feedback.

    Why is this interesting? Well, this did not work in the GPT-3 era; it seems to do so now. I see this as an interesting line to be added in the ‘model capabilities’ box as our models get larger and more sophisticated — the LLMs can perform some sort of internally guided search against a black box generator and use a black box scorer to improve at inference time.

    That’s pretty cool. It’s also generalizable, and I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.

  • Post Author
    nico
    Posted April 26, 2025 at 4:29 pm

    To people curious or skeptical if this could be called “seeing” or “hearing”, I recommend listening to the Batman podcast episode on NPR (https://www.npr.org/2015/01/23/379134306/batman-pt-1)

    Through the story and experience of a blind man, they end up getting into the question of what does it mean to see

    The podcast is pretty straightforward, but it does end up showing that defining “seeing” is a philosophical question, rather than a simple obvious answer

  • Post Author
    TheCoreh
    Posted April 26, 2025 at 5:42 pm

    Is the LLM essentially playing "Wordle" with an external system that rates the quality of its output, gradually climbing the score ladder until it produces good results?

  • Post Author
    robocop_legacy
    Posted April 26, 2025 at 6:51 pm

    I think there is potentially a powerful method here. Specifically, the optimal context for a given task can be saved and a meta-learner can be trained to map the task to the context. This would allow fine tuning a model for some specific task without retaining the LLM. For example, generating an SEM image with of some material with a specified porosity and grain size.

  • Post Author
    v-rt
    Posted April 26, 2025 at 7:33 pm

    "without training" describes transfer learning

  • Post Author
    v01rt
    Posted April 26, 2025 at 7:34 pm

    "without training" describes transfer learning with an actor / critic approach

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.