To fit within GPU memory, the first layer is not loaded on the root node; instead, it’s loaded into RAM. For this, I used a new argument: --gpu-segments
.
)
📄 EosId: 128001 (<|end_of_text|>) 128009 (<|eot_id|>)
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 8192
💡 KvDim: 1024
💡 HiddenDim: 28672
💡 VocabSize: 128256
💡 nLayers: 80
💡 nHeads: 64
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 256
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 13724387 kB
⭕ Socket[0]: connecting to 127.0.0.1:9999 worker
⭕ Socket[0]: connected
⭕ Socket[1]: connecting to 127.0.0.1:9998 worker
⭕ Socket[1]: connected
⭕ Socket[2]: connecting to 127.0.0.1:9997 worker
⭕ Socket[2]: connected
⭕ Network is initialized
🌋 Device: NVIDIA GeForce RTX 3060
🌋 DeviceApiVersion: 1.3.289
🌋 MaxComputeSharedMemory: 48 kB
🌋 Heap[0]: 12288 MB
🌋 Heap[2]: 246 MB
🧠 CPU: avx2
💿 Loading weights…
💿 Weights loaded
🚁 Network is in non-blocking mode
Tensor parallelism is all you need
🔷️ Eval 950 ms Sync 237 ms | Sent 29232 kB Recv 31190 kB | (7 tokens)
🔶 Pred 367 ms Sync 30 ms | Sent 4176 kB Recv 4455 kB | Anna
🔶 Pred 290 ms Sync 35 ms | Sent 4176 kB Recv 4455 kB | were
🔶 Pred 272 ms Sync 21 ms | Sent 4176 kB Recv 4455 kB | now
🔶 Pred 265 ms Sync 71 ms | Sent 4176 kB Recv 4455 kB | named”>
(main) root@C.19722726:/workspace/distributed-llama$ ./dllama inference --prompt "Tensor parallelism is all you need" --steps 128 --model models/llama3_3_70b_instruct_q40/dllama_model_llama3_3_70b_instruct_q40.m --tokenizer models/llama3_3_70b_instruct_q40/dllama_tokenizer_llama3_3_70b_instruct_q40.t --nthreads 1 --buffer-float-type q80 --max-seq-len 256 --gpu-index 0 --workers 127.0.0.1:9999 127.0.0.1:9998 127.0.0.1:9997 --gpu-segments 1:99999
📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128001 (<|end_of_text|>) 128009 (<|eot_id|>)
📄 RegularV|eot_id|>|end_of_text|>|begin_of_text|>
|eot_id|>|end_of_text|>