Skip to content Skip to footer
0 items - $0.00 0

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon by dipampaul17

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon by dipampaul17

Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon by dipampaul17

8 Comments

  • Post Author
    nico
    Posted May 16, 2025 at 8:34 pm

    Great work. This seems very interesting, but I need something slightly more high level to relate to it

    Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?

    What’s the ideal use case for local models?

    Thank you

  • Post Author
    badmonster
    Posted May 16, 2025 at 8:40 pm

    I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?

  • Post Author
    entrepy123
    Posted May 16, 2025 at 8:49 pm

    Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?

    I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.

    So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.

  • Post Author
    behnamoh
    Posted May 16, 2025 at 8:52 pm

    Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.

  • Post Author
    matheist
    Posted May 16, 2025 at 8:53 pm

    Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?

    A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.

    A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.

  • Post Author
    ondra
    Posted May 16, 2025 at 8:58 pm

    Is this any different from using –cache-type-k and –cache-type-v?

  • Post Author
    segmondy
    Posted May 16, 2025 at 9:07 pm

    [flagged]

  • Post Author
    smcleod
    Posted May 16, 2025 at 9:31 pm

    +0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.