
Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon by dipampaul17
Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism’s KV cache. KVSplit enables you to:
- Reduce memory usage by up to 72% with minimal quality loss
- Run 2-3x longer contexts in the same memory budget
- Maintain or improve inference speed compared to FP16
- Optimize for Apple Silicon with full Metal support
Configuration | VRAM @ 8K tokens | Tokens/sec | Perplexity Change |
---|---|---|---|
FP16 (base) | 176.00 MB (100%) | 54,360 | — |
K8V8 (8-bit) | 93.50 MB (47%) | 51,503 | +0.03% |
K8V4 | 71.50 MB (41%) | 57,438 | +0.86% |
K4V8 | 71.50 MB (41%) | 58,690 | +6.06% |
K4V4 (4-bit) | 49.50 MB (28%) | 55,193 | +6.15% |
Configuration | 128 tokens | 2048 tokens | 4096 tokens | 8192 tokens |
---|---|---|---|---|
FP16 (baseline) | 5.50 MB | 44.00 MB | 88.00 MB | 176.00 MB |
K8V8 (8-bit) | 2.92 MB | 23.38 MB | 46.75 MB | 93.50 MB |
K8V4 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB |
K4V8 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB |
K4V4 (4-bit) | 1.55 MB | 12.38 MB | 24.75 MB | 49.50 MB |
- Independent quantization of keys and values in the KV cache
- Optimized for Apple Silicon with Metal support
- Comprehensive benchmarking suite with perplexity measurement
- Memory usage and performance analysis tools
- Publication-quality visualization tools
- Easy setup and usage
- macOS (tested on Apple Silicon)
- Homebrew package manager
- Xcode Command Line Tools
# Clone the repository git clone https://github.com/dipampaul17/KVSplit.git cd kvsplit # Run the installer script chmod +x scripts/install_kvsplit.sh ./scripts/install_kvsplit.sh
The installer will:
- Set up the project structure
- Clone and build llama.cpp with Metal support
- Configure for differentiated KV cache quantization
- Download a small test model (optional)
- Set up Python environment for visualization
Want to see the benefits immediately? Run a quick comparison with your model:
# Run quick comparison with different configurations
python scripts/quick_compare.py --model models/your-model.gguf
This will show you a side-by-side comparison of FP16, K8V8, K8V4, K4V8, and K4V4 with memory usage, speed, and quality metrics.
Configuration | VRAM @ 8K tokens | Memory Savings | Quality Impact |
---|---|---|---|
FP16 (base) | 176.00 MB | — | — |
K8V8 (8-bit) | 93.50 MB | 47% | +0.03% |
K8V4 | 71.50 MB | 59% | +0.86% |
K4V8 | 71.50 MB | 59% | +6.06% |
K4V4 (4-bit) | 49.50 MB | 72% | +6.15% |
Using KVSplit doesn’t just save memory—it often improves inference speed by 5-15%!
Configuration | Tokens/sec (8K ctx) | Speedup vs FP16 |
---|---|---|
FP16 | 54,360 | — |
K8V8 | 51,503 | -5.3% |
K8V4 | 57,438 | +5.7% |
K4V8 | 58,690 | +8.0% |
K4V4 | 55,193 | +1.5% |
kvsplit/
├── llama.cpp/ # Optimized llama.cpp build
├── models/ # LLM model files
├── scripts/ # Utility scripts
│ ├── benchmark_kvsplit.py # Comprehensive benchmark tool
│ ├── install_kvsplit.sh # One-command installer
│ ├── quick_compare.py # Quick comparison utility
│ ├── capture_memory.sh # GIF creation for memory visualization
│ └── visualize_results.py # Generate publication-quality plots
├── results/ # Benchmark results (CSV/JSON)
├── plots/ # Generated visualizations
└── README.md # This file
KV cache memory is dominated by storing key and value vectors for each token. Our research has revealed a critical insight: keys are significantly more sensitive to quantization than values.
- Asymmetric Impact: Keys require higher precision than values for maintaining quality
- Sweet Spot: K8V4 (8-bit keys, 4-bit values) provides optimal balance
- Only 0.86% perplexity degradation vs. FP16
- 59% memory reduction
- Faster inference than FP16
- Confirmation: K4V8 configuration shows 7x more quality degradation than K8V4, despite using the same total bits
This asymmetry allows for more efficient memory usage without compromising model quality, enabling longer context windows and larger models on consumer hardware.
# Baseline (FP16) ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" -t 8 --flash-attn # ⭐ RECOMMENDED: 8-bit keys, 4-bit values (K8V4) # Best balance of quality and memory savings ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" -t 8 --flash-attn --kvq 8 # 4-bit keys, 8-bit values (K4V8) # Shows why key precision matters more than value precision ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" -t 8 --flash-attn --kvq-key 4 --kvq-val 8 # 4-bit keys and values (K4V4) # Maximum memory savings (72% reduction) with acceptable quality ./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" -t 8 --flash-attn --kvq 4
# Run with a 32K context (would require ~1.4GB in FP16, only ~400MB with K8V4)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf
-c 32768 -n 4096 -t 8 --flash-attn --kvq 8
-f your-long-document.txt
Flag | Description | Recommendation |
---|---|---|
-t 8 |
Number of threads | 8 is optimal for most Apple Silicon chips |
--flash-attn |
Enables optimized attention | Recommended for Apple Silicon |
--kvq N |
Sets both key and value bits to N | Use --kvq 8 for K8V4 configuration |
--kvq-key N |
Sets key bits only | Key precision has major quality impact |
--kvq-val N |
Sets value bits only | Value precision has minor quality impact |
-c N |
Context size in tokens | Longer contexts benefit more from KVSplit |
-n N |
Number of tokens to generate | Adjust based on your needs |
-f FILE |
Input file | For processing documents |
-m MODEL |
Model path | Path to your .gguf model file |
For comprehensive performance analysis, use our full benchmark suite:
# Run the full benchmark suite (all configurations and sequence lengths) python scripts/benchmark_kvsplit.py # Run a specific configuration test python scripts/benchmark_kvsplit.py --config K8V4 --seq-len 4096 # Generate publication-quality visualizations python scripts/visualize_results.py
The benchmarking script provides thorough measu
8 Comments
nico
Great work. This seems very interesting, but I need something slightly more high level to relate to it
Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?
What’s the ideal use case for local models?
Thank you
badmonster
I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?
entrepy123
Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?
I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.
So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.
behnamoh
Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.
matheist
Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?
A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.
ondra
Is this any different from using –cache-type-k and –cache-type-v?
segmondy
[flagged]
smcleod
+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?