Vision Now Available in Llama.cpp by redman25

ByHackTech May 10, 2025

16Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (16)

16 Comments

Post Author

simonw

Posted May 10, 2025 at 4:17 am

This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd…

0Likes Log in to Reply
Post Author

gryfft

Posted May 10, 2025 at 4:20 am

Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.

0Likes Log in to Reply
Post Author

nico

Posted May 10, 2025 at 4:48 am

How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

0Likes Log in to Reply
Post Author

behnamoh

Posted May 10, 2025 at 5:07 am

didn't llama.cpp use to have vision support last year or so?

0Likes Log in to Reply
Post Author

danielhanchen

Posted May 10, 2025 at 5:10 am

It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support – literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1

Then load the image with /image image.png inside the chat, and chat away!

EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.

0Likes Log in to Reply
Post Author

banana_giraffe

Posted May 10, 2025 at 5:40 am

I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.

Very nice for something that's self hosted.

0Likes Log in to Reply
Post Author

nurettin

Posted May 10, 2025 at 6:23 am

Didn't we already have vision via llava?

0Likes Log in to Reply
Post Author

gitroom

Posted May 10, 2025 at 6:29 am

Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?

0Likes Log in to Reply
Post Author

buyucu

Posted May 10, 2025 at 6:30 am

It was really sad when vision was removed back a while ago. It's great to see it restored. Many thanks to everyone involved!

0Likes Log in to Reply
Post Author

simonw

Posted May 10, 2025 at 6:31 am
llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332

On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:

unzip llama-b5332-bin-macos-arm64.zip cd build/bin sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib

Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)

./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

Or start the localhost 8080 web server (with a UI and API) like this:

./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/
0Likes Log in to Reply

Post Author

ngxson

Posted May 10, 2025 at 6:51 am

We also support SmolVLM series which delivers light-speed response thanks to its mini size!

This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!

    llama-server -hf ggml-org/SmolVLM-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-256M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
    llama-server -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

0Likes

Post Author

dust42

Posted May 10, 2025 at 6:55 am
To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get

25t/s prompt processing 63t/s token generation

Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.

Steps to reproduce:

git clone https://github.com/ggml-org/llama.cpp.git cmake -B build cmake --build build --config Release -j 12 --clean-first # download model and mmproj files... build/bin/llama-server --model gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf

Then open http://127.0.0.1:8080/ for the web interface

Note: if you are not using -hf, you must include the –mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.

I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
0Likes Log in to Reply
Post Author

mrs6969

Posted May 10, 2025 at 7:15 am

so image processing there but image generation isn't ?

just trying to understand, awesome work so far.

0Likes Log in to Reply
Post Author

bsaul

Posted May 10, 2025 at 7:41 am

great news !
sidenote : Does vision include the ability to read a pdf ?

0Likes Log in to Reply
Post Author

a_e_k

Posted May 10, 2025 at 8:06 am

This is excellent. I've been pulling and rebuilding periodically, and watching the commit notes as they (mostly ngxson, I think) first added more vision models, each with their own CLI program, then unified those under a single CLI program and deprecated the standalone one, while bug fixing and improving the image processing. I'd been hoping that meant they'd eventually add support to the server again, and now it's here! Thanks!

0Likes Log in to Reply
Post Author

nikolayasdf123

Posted May 10, 2025 at 8:22 am

finally! very important use-case! glad they added it!

0Likes Log in to Reply

Vision Now Available in Llama.cpp by redman25

Vision Now Available in Llama.cpp by redman25

Share This Article

Newsletter

HackTech

16 Comments

simonw

gryfft

nico

behnamoh

danielhanchen

banana_giraffe

nurettin

gitroom

buyucu

simonw

ngxson

dust42

mrs6969

bsaul

a_e_k

nikolayasdf123

Leave a comment Cancel reply

Editor's Choice

Vision Now Available in Llama.cpp by redman25

Vision Now Available in Llama.cpp by redman25

Share This Article

Newsletter

16 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter