
LLMs can see and hear without any training by T-A
Official implementation of the paper LLMs can see and hear without any training.
Install the conda environment using
conda env create -f environment.yml conda activate MILS
Download the following datasets, annotations, and checkpoints
MS-COCO: Download the MS-COCO validation dataset from the official website here. Also, download the 5000 samples test split used in Karpathy et al., Deep visual-semantic alignments for generating image descriptions, CVPR 2015.
wget http://images.cocodataset.org/zips/val2014.zip wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip unzip val2014.zip unzip annotations_trainval2014.zip
Clotho: Download the clotho dataset from the official website here. We use the test split of this dataset for our benchmarking.
wget https://zenodo.org/records/3490684/files/clotho_audio_evaluation.7z pip3 install dtrx wget https://www.7-zip.org/a/7z2107-linux-x64.tar.xz tar xf 7z2107-linux-x64.tar.xz ./7zz e clotho_audio_evaluation.7z wget https://zenodo.org/records/3490684/files/clotho_captions_evaluation.csv
MSR-VTT: Download the dataset from here. We use the test split of this dataset.
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip unzip MSRVTT.zip
ViClip-InternVid-10M-FLT.pth: Download from
here and set the correct path in task_utils/video/viclip.py
.
Update the variables in paths.py to set the dataset directory, and the output folder.
MILS is an inference-only method that can be run on a single A100 GPU. We run the experiments on eight A100 GPUs, and the code below can be adjusted for any number of GPUs.
Generate captions using
CUDA_VISIBLE_DEVICES=0 python main_image_captioning.py --process 0 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=1 python main_image_captioning.py --process 1 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=2 python main_image_captioning.py --process 2 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=3 python main_image_captioning.py --process 3 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=4 python main_image_captioning.py --process 4 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=5 python main_image_captioning.py --process 5 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=6 python main_image_captioning.py --process 6 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=7 python main_image_captioning.py --process 7 --num_processes 8 --batch_size 32 &
The captions are saved in OUTPUT_DIR
. Specify this path in ours_result_path
variable in eval/image_captioning.py
and then obtain captioning metrics as
python eval/image_captioning.py
Generate captions using
CUDA_VISIBLE_DEVICES=0 python main_audio_captioning.py --process 0 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=1 python main_audio_captioning.py --process 1 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=2 python main_audio_captioning.py --process 2 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=3 python main_audio_captioning.py --process 3 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=4 python main_audio_captioning.py --process 4 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=5 python main_audio_captioning.py --process 5 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=6 python main_audio_captioning.py --process 6 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=7 python main_audio_captioning.py --process 7 --num_processes 8 --batch_size 32 &
The captions are saved in OUTPUT_DIR
. Specify this path in address
variable in eval/audio_captioning.py
and then obtain captioning metrics as
python eval/audio_captioning.py
Generate captions using
CUDA_VISIBLE_DEVICES=0 python main_video_captioning.py --process 0 --num_processes 8 --batch_size 8 & CUDA_VISIBLE_DEVICES=1 python main_video_captioning.py --process 1 --num_processes 8 --batch_size 8 & CUDA_VISIBLE_DEVICES=2 python main_video_captioning.py --process 2 --num_processes 8 --batch_size 8 & CUD
16 Comments
underdeserver
Paper: https://arxiv.org/pdf/2501.18096
jagged-chisel
Computers can receive input without any programming. Not sure what’s interesting here.
scribu
This seems to be a system to generate better prompts to be fed into a base multimodal model.
Interesting, but title is definitely clickbait.
EncomLab
My photoresistor nightlight can "see" that it is dark and it "knows" to turn on the light – not only does it not have training, it does not have any code!
And if you think that is amazing, my bi-metallic strip thermostat "feels" the temperature and then modifies the environment because it "knows" if it's hot to turn on the A/C, and if it's cold to turn on the heat – no training or code!
All of this AI stuff is just unbelievably incredible – what a brave new world (of word games)!
viraptor
That looks like a classic Actor/Critic setup, yet it's not mentioned even once in the paper. Am I missing some large difference here?
lngnmn2
[dead]
JoBrad
Exactly how little training is "without any"? I'm assuming that companies haven't been spending billions trying to train LLMs to better understand things when they can do it without any training.
blogabegonija
[dead]
3rdworldeng
Find me Jose Monkey will do that too :-)
sega_sai
The paper certainly contradicts my expectation from the title. I.e. it does not present an LLM that can generate images without any access to images before.
vessenes
I’ve read the paper and the skeptical comments here, to wit: it’s just an actor/critic pipeline by another name.
I’ll bite and say this is actually interesting — and the paper title is misleading.
What they’ve done here is hooked up a text-only LLM to multimodal critics, given it (mostly) an image diffusion generation task, and asked it to improve its prompting of the multimodal generation by getting a set of scores back.
This definitely works, based on their outputs. Which is to say, LLMs can, zero shot, with outside tool feedback, iteratively improve their prompting using only that tooling feedback.
Why is this interesting? Well, this did not work in the GPT-3 era; it seems to do so now. I see this as an interesting line to be added in the ‘model capabilities’ box as our models get larger and more sophisticated — the LLMs can perform some sort of internally guided search against a black box generator and use a black box scorer to improve at inference time.
That’s pretty cool. It’s also generalizable, and I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
nico
To people curious or skeptical if this could be called “seeing” or “hearing”, I recommend listening to the Batman podcast episode on NPR (https://www.npr.org/2015/01/23/379134306/batman-pt-1)
Through the story and experience of a blind man, they end up getting into the question of what does it mean to see
The podcast is pretty straightforward, but it does end up showing that defining “seeing” is a philosophical question, rather than a simple obvious answer
TheCoreh
Is the LLM essentially playing "Wordle" with an external system that rates the quality of its output, gradually climbing the score ladder until it produces good results?
robocop_legacy
I think there is potentially a powerful method here. Specifically, the optimal context for a given task can be saved and a meta-learner can be trained to map the task to the context. This would allow fine tuning a model for some specific task without retaining the LLM. For example, generating an SEM image with of some material with a specified porosity and grain size.
v-rt
"without training" describes transfer learning
v01rt
"without training" describes transfer learning with an actor / critic approach