< English | ?????? >
Important
You can now run DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon using the latest llama.cpp Portable Zip.
This guide demonstrates how to use llama.cpp portable zip to directly run llama.cpp on Intel GPU with ipex-llm
(without the need of manual installations).
Note
llama.cpp portable zip has been verified on:
- Intel Core Ultra processors
- Intel Core 11th – 14th gen processors
- Intel Arc A-Series GPU
- Intel Arc B-Series GPU
Check your GPU driver version, and update it if needed:
-
For Intel Core Ultra processors (Series 2) or Intel Arc B-Series GPU, we recommend updating your GPU driver to the latest
-
For other Intel iGPU/dGPU, we recommend using GPU driver version 32.0.101.6078
Download IPEX-LLM llama.cpp portable zip for Windows users from the link.
Then, extract the zip file to a folder.
- Open “Command Prompt” (cmd), and enter the extracted folder through
cd /d PATHTOEXTRACTEDFOLDER
- To use GPU acceleration, several environment variables are required or recommended before running
llama.cpp
.set SYCL_CACHE_PERSISTENT=1
- For multi-GPUs user, go to Tips for how to select specific GPU.
Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
Before running, you should download or copy community GGUF model to your local directory. For instance, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
of bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF.
Please change PATHTODeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
to your model path before your run below command.
llama-cli.exe -m PATHTODeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed withinand tags, respectively, i.e., reasoning process here answer here . User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant:" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0
Part of outputs:
Download IPEX-LLM llama.cpp portable tgz for Linux users from the link.
Then, extract the tgz file to a folder.
- Open a “Terminal”, and enter the extracted folder through
cd /PATH/TO/EXTRACTED/FOLDER
- To use GPU acceleration, several environment variables are required or recommended before running
llama.cpp
.export SYCL_CACHE_PERSISTENT=1
- For multi-GPUs user, go to Tips for how to select specific GPU.
Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
Before running, you should download or copy community GGUF model to your local directory. For instance, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
of bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF.
Please change /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
to your model path before your run below command.
./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, i.e., reasoning process here answer here . User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: " -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0
Part of outputs:
FlashMoE is a command-line tool built on llama.cpp, optimized for mixture-of-experts (MoE) models such as DeepSeek V3/R1. Now, it’s available for Linux platforms.
Tested MoE GGUF Models (other MoE GGUF models are also supported):
Requirements:
- 380GB CPU Memory
- 1-8 ARC A770
- 500GB Disk
Note:
- Larger models and other precisions may require more resources.
- For 1 ARC A770 platform, please reduce context length (e.g., 1024) to avoid OOM. Add this option
-c 1024
at the end of below command.
Before running, you should download or copy community GGUF model to
10 Comments
superkuh
No… this headline is incorrect. You can't do that. I think they've confused the performance of running one of the small distills to existing smaller models. Two Arc cards cannot fit a 4 bit k-quant of a 671b model.
But a portable (no install) way to run llama.cpp on intel GPUs is really cool.
ryao
Where is the benchmark data?
zamadatix
Since the Xeon alone could run the model in this set up it'd be more interesting if they compared the performance uplift with using 0/1/2..8 Arc A770 GPUs.
Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic…
colorant
https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic…
Requirements (>8 token/s):
380GB CPU Memory
1-8 ARC A770
500GB Disk
jamesy0ung
What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?
7speter
I’ve been following the progress Intel Arc support in Pytorch is making, at least in Linux, and it seems like if things stay on track, we may see the first version of pytorch with full Xe/Arc support by around June. I think I’m just going to wait until then instead of dealing with anything ipex or openvino.
CamperBob2
Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)
yongjik
Did DeepSeek learn how to name their models from OpenAI.
anacrolix
Now we just need a model that can actually code
chriscappuccio
Better to run the Q8 model on an epyc pair with 768GB, you'll get the same performance