Run Llama 3.3 70B Q40 on $1516 GPU 3.3 tok/s by b4rtazz

Share This Article

Sed ut perspiciatis unde.

The cost of a single GPU card is $379, so the total cost for all GPUs is ~$1516.

To fit within GPU memory, the first layer is not loaded on the root node; instead, it’s loaded into RAM. For this, I used a new argument: --gpu-segments.

	4 x RTX 3060 12 GB
Llama 3.3 70B Instruct Q40 – eval	7.29 tok / s
Llama 3.3 70B Instruct Q40 – pred	3.35 tok / s

Real-time video

4gpu.mov

Log

)
📄 EosId: 128001 (<|end_of_text|>) 128009 (<|eot_id|>)
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 8192
💡 KvDim: 1024
💡 HiddenDim: 28672
💡 VocabSize: 128256
💡 nLayers: 80
💡 nHeads: 64
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 256
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 13724387 kB
⭕ Socket[0]: connecting to 127.0.0.1:9999 worker
⭕ Socket[0]: connected
⭕ Socket[1]: connecting to 127.0.0.1:9998 worker
⭕ Socket[1]: connected
⭕ Socket[2]: connecting to 127.0.0.1:9997 worker
⭕ Socket[2]: connected
⭕ Network is initialized
🌋 Device: NVIDIA GeForce RTX 3060
🌋 DeviceApiVersion: 1.3.289
🌋 MaxComputeSharedMemory: 48 kB
🌋 Heap[0]: 12288 MB
🌋 Heap[2]: 246 MB
🧠 CPU: avx2
💿 Loading weights…
💿 Weights loaded
🚁 Network is in non-blocking mode
Tensor parallelism is all you need
🔷️ Eval 950 ms Sync 237 ms | Sent 29232 kB Recv 31190 kB | (7 tokens)
🔶 Pred 367 ms Sync 30 ms | Sent 4176 kB Recv 4455 kB | Anna
🔶 Pred 290 ms Sync 35 ms | Sent 4176 kB Recv 4455 kB | were
🔶 Pred 272 ms Sync 21 ms | Sent 4176 kB Recv 4455 kB | now
🔶 Pred 265 ms Sync 71 ms | Sent 4176 kB Recv 4455 kB | named”>

(main) root@C.19722726:/workspace/distributed-llama$ ./dllama inference --prompt "Tensor parallelism is all you need" --steps 128 --model models/llama3_3_70b_instruct_q40/dllama_model_llama3_3_70b_instruct_q40.m --tokenizer models/llama3_3_70b_instruct_q40/dllama_tokenizer_llama3_3_70b_instruct_q40.t --nthreads 1 --buffer-float-type q80 --max-seq-len 256 --gpu-index 0 --workers 127.0.0.1:9999 127.0.0.1:9998 127.0.0.1:9997 --gpu-segments 1:99999
📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128001 (<|end_of_text|>) 128009 (<|eot_id|>) 
📄 RegularV

Run Llama 3.3 70B Q40 on $1516 GPU 3.3 tok/s by b4rtazz

Run Llama 3.3 70B Q40 on $1516 GPU 3.3 tok/s by b4rtazz

Share This Article

Newsletter

Real-time video

Log

HackTech

Leave a comment Cancel reply

Editor's Choice

Run Llama 3.3 70B Q40 on $1516 GPU 3.3 tok/s by b4rtazz

Run Llama 3.3 70B Q40 on $1516 GPU 3.3 tok/s by b4rtazz

Share This Article

Newsletter

Real-time video

Log

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter