Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.
Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?
To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get
25t/s prompt processing
63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.
Note: if you are not using -hf, you must include the –mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
This is excellent. I've been pulling and rebuilding periodically, and watching the commit notes as they (mostly ngxson, I think) first added more vision models, each with their own CLI program, then unified those under a single CLI program and deprecated the standalone one, while bug fixing and improving the image processing. I'd been hoping that meant they'd eventually add support to the server again, and now it's here! Thanks!
Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.
Our site uses cookies. Learn more about our use of cookies: cookie policyACCEPTREJECT
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
16 Comments
simonw
This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd…
gryfft
Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.
nico
How does this compare to using a multimodal model like gemma3 via ollama?
Any benefit on a Mac with apple silicon? Any experiences someone could share?
behnamoh
didn't llama.cpp use to have vision support last year or so?
danielhanchen
It works super well!
You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.
I made some quants with vision support – literally run:
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl -1
./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl -1
Then load the image with /image image.png inside the chat, and chat away!
EDIT: -ngl -1 is not needed anymore for Metal backends (CUDA still yes) (llama.cpp will auto offload to the GPU by default!). -1 means all GPU layers offloaded to the GPU.
banana_giraffe
I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.
Very nice for something that's self hosted.
nurettin
Didn't we already have vision via llava?
gitroom
Man, the ngl abbreviation gets me every time too. Kinda cool seeing all the tweaks folks do to make this stuff run faster on their Macs. You think models hitting these speed boosts will mean more people start playing with vision stuff at home?
buyucu
It was really sad when vision was removed back a while ago. It's great to see it restored. Many thanks to everyone involved!
simonw
llama.cpp offers compiled releases for multiple platforms. This release has the new vision features: https://github.com/ggml-org/llama.cpp/releases/tag/b5332
On macOS I downloaded the llama-b5332-bin-macos-arm64.zip file and then had to run this to get it to work:
Then I could run the interactive terminal (with a 3.2GB model download) like this (borrowing from https://news.ycombinator.com/item?id=43943370R)
Or start the localhost 8080 web server (with a UI and API) like this:
I wrote up some more detailed notes here: https://simonwillison.net/2025/May/10/llama-cpp-vision/
ngxson
We also support SmolVLM series which delivers light-speed response thanks to its mini size!
This is perfect for real-time home video surveillance system. That's one of the ideas for my next hobby project!
dust42
To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.
Steps to reproduce:
Then open http://127.0.0.1:8080/ for the web interface
Note: if you are not using -hf, you must include the –mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
mrs6969
so image processing there but image generation isn't ?
just trying to understand, awesome work so far.
bsaul
great news !
sidenote : Does vision include the ability to read a pdf ?
a_e_k
This is excellent. I've been pulling and rebuilding periodically, and watching the commit notes as they (mostly ngxson, I think) first added more vision models, each with their own CLI program, then unified those under a single CLI program and deprecated the standalone one, while bug fixing and improving the image processing. I'd been hoping that meant they'd eventually add support to the server again, and now it's here! Thanks!
nikolayasdf123
finally! very important use-case! glad they added it!