DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon by colorant

Share This Article

Sed ut perspiciatis unde.

< English | ?????? >

Important

You can now run DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon using the latest llama.cpp Portable Zip.

This guide demonstrates how to use llama.cpp portable zip to directly run llama.cpp on Intel GPU with ipex-llm (without the need of manual installations).

Note

llama.cpp portable zip has been verified on:

Intel Core Ultra processors
Intel Core 11th – 14th gen processors
Intel Arc A-Series GPU
Intel Arc B-Series GPU

Check your GPU driver version, and update it if needed:

For Intel Core Ultra processors (Series 2) or Intel Arc B-Series GPU, we recommend updating your GPU driver to the latest
For other Intel iGPU/dGPU, we recommend using GPU driver version 32.0.101.6078

Step 1: Download and Unzip

Download IPEX-LLM llama.cpp portable zip for Windows users from the link.

Then, extract the zip file to a folder.

Step 2: Runtime Configuration

Open “Command Prompt” (cmd), and enter the extracted folder through cd /d PATHTOEXTRACTEDFOLDER
To use GPU acceleration, several environment variables are required or recommended before running llama.cpp.
```
set SYCL_CACHE_PERSISTENT=1
```
For multi-GPUs user, go to Tips for how to select specific GPU.

Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.

Before running, you should download or copy community GGUF model to your local directory. For instance, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf of bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF.

Please change PATHTODeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf to your model path before your run below command.

and tags, respectively, i.e., reasoning process here answer here . User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: ” -n 2048 -t 8 -e -ngl 99 –color -c 2500 –temp 0″ dir=”auto”>

llama-cli.exe -m PATHTODeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here   answer here . User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: " -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0

Part of outputs:

Check your GPU driver version, and update it if needed; we recommend following Intel client GPU driver installation guide to install your GPU driver.

Step 1: Download and Extract

Download IPEX-LLM llama.cpp portable tgz for Linux users from the link.

Then, extract the tgz file to a folder.

Step 2: Runtime Configuration

Open a “Terminal”, and enter the extracted folder through cd /PATH/TO/EXTRACTED/FOLDER
To use GPU acceleration, several environment variables are required or recommended before running llama.cpp.
```
export SYCL_CACHE_PERSISTENT=1
```
For multi-GPUs user, go to Tips for how to select specific GPU.

Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.

Before running, you should download or copy community GGUF model to your local directory. For instance, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf of bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF.

Please change /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf to your model path before your run below command.

./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here   answer here . User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: " -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0

Part of outputs:

FlashMoE for DeepSeek V3/R1

FlashMoE is a command-line tool built on llama.cpp, optimized for mixture-of-experts (MoE) models such as DeepSeek V3/R1. Now, it’s available for Linux platforms.

Tested MoE GGUF Models (other MoE GGUF models are also supported):

Run DeepSeek V3/R1 with FlashMoE

Requirements:

380GB CPU Memory
1-8 ARC A770
500GB Disk

Note:

Larger models and other precisions may require more resources.
For 1 ARC A770 platform, please reduce context length (e.g., 1024) to avoid OOM. Add this option -c 1024 at the end of below command.

Before running, you should download or copy community GGUF model to

Post Author

superkuh

Posted March 6, 2025 at 1:10 am

No… this headline is incorrect. You can't do that. I think they've confused the performance of running one of the small distills to existing smaller models. Two Arc cards cannot fit a 4 bit k-quant of a 671b model.

But a portable (no install) way to run llama.cpp on intel GPUs is really cool.

0Likes Log in to Reply
Post Author

ryao

Posted March 6, 2025 at 1:23 am

Where is the benchmark data?

0Likes Log in to Reply
Post Author

zamadatix

Posted March 6, 2025 at 1:31 am

Since the Xeon alone could run the model in this set up it'd be more interesting if they compared the performance uplift with using 0/1/2..8 Arc A770 GPUs.

Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic…

0Likes Log in to Reply
Post Author

colorant

Posted March 6, 2025 at 1:37 am

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic…

Requirements (>8 token/s):

380GB CPU Memory

1-8 ARC A770

500GB Disk

0Likes Log in to Reply
Post Author

jamesy0ung

Posted March 6, 2025 at 1:54 am

What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?

0Likes Log in to Reply
Post Author

7speter

Posted March 6, 2025 at 2:10 am

I’ve been following the progress Intel Arc support in Pytorch is making, at least in Linux, and it seems like if things stay on track, we may see the first version of pytorch with full Xe/Arc support by around June. I think I’m just going to wait until then instead of dealing with anything ipex or openvino.

0Likes Log in to Reply
Post Author

CamperBob2

Posted March 6, 2025 at 2:39 am

Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)

0Likes Log in to Reply
Post Author

yongjik

Posted March 6, 2025 at 3:02 am

Did DeepSeek learn how to name their models from OpenAI.

0Likes Log in to Reply
Post Author

anacrolix

Posted March 6, 2025 at 3:07 am

Now we just need a model that can actually code

0Likes Log in to Reply
Post Author

chriscappuccio

Posted March 6, 2025 at 3:17 am

Better to run the Q8 model on an epyc pair with 768GB, you'll get the same performance

0Likes Log in to Reply

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon by colorant

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon by colorant

Share This Article

Newsletter

Step 1: Download and Unzip

Step 2: Runtime Configuration

Step 1: Download and Extract

Step 2: Runtime Configuration

FlashMoE for DeepSeek V3/R1

Run DeepSeek V3/R1 with FlashMoE

HackTech

10 Comments

superkuh

ryao

zamadatix

colorant

jamesy0ung

7speter

CamperBob2

yongjik

anacrolix

chriscappuccio

Leave a comment Cancel reply

Editor's Choice

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon by colorant

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon by colorant

Share This Article

Newsletter

Step 1: Download and Unzip

Step 2: Runtime Configuration

Step 1: Download and Extract

Step 2: Runtime Configuration

FlashMoE for DeepSeek V3/R1

Run DeepSeek V3/R1 with FlashMoE

10 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter