What went into training DeepSeek-R1? by cubefox

Share This Article

Sed ut perspiciatis unde.

On January 20th, 2025, DeepSeek released their latest open-weights reasoning model, DeepSeek-R1, which is on par with OpenAI’s o1 in benchmark performance. The release has generated a significant amount of controversy, most notably about the possibility that DeepSeek might have underreported or misrepresented the training cost of their model. I find this claim implausible for reasons that I will explore in this issue.

Aside from the point about the model’s training cost, I also want to clarify what we actually know about the model’s architecture, training process, performance, and pricing.

Architecture

DeepSeek R1’s architecture is identical to DeepSeek v3, an earlier model that the company released in December 2024. I covered the key architectural details of this model in a Gradient Updates issue from two weeks ago, so I will only provide a brief high-level summary here.

Overall, the model is a very sparse mixture-of-experts, with 671 billion total parameters but only 37 billion active per token. The experts are divided into two classes: one “shared expert” which every token is always routed to, and 256 “routed experts” of which 8 are active for any particular processed token and for which model training tries to ensure balanced routing. Most parameters are MoE parameters for the routed experts, and we can confirm this with the following sanity check based on the model configuration file from HuggingFace:

Routed MoE params = (MoE blocks) * (routed experts) * (tensors per expert) * (MoE intermediate dim) * (model hidden dim)
= 58 * 256 * 3 * 2048 * 7168 = 653 billion

DeepSeek v3 also uses a novel mechanism called multi-head latent attention (MLA) to cut down the size of the KV cache without the performance loss associated with other popular methods such as grouped-query and multi-query attention. This comes at the expense of increasing the arithmetic cost of attention during decoding, making DeepSeek v3 unusual among other language models in being arithmetic-bound rather than memory-bound during long-context inference. The arithmetic cost of attention becomes comparable to the parameter multiply-accumulates around a past context length of 5000 tokens, compared to e.g. Llama 3 70B where this only happens around a context length of 50,000 tokens.

The ablation experiments from the DeepSeek v2 paper which introduced the method show significant performance gains from using MLA over GQA and MQA. It’s unclear how much we can trust these results, but this might have contributed to the base model’s high quality at its relatively low inference cost.

Many of these innovations are quite old: for example, MLA was introduced in the v2 paper which came out in June 2024. The real improvement of R1 over V3 is the use of reinforcement learning to improve reasoning performance, which I will cover in the next section. However, the architecture of the base model still matters because reinforcement learning works much better on base models which already have high intrinsic performance, as this reduces the sparsity of the initial reward signals during RL. So understanding how DeepSeek was able to build a performant base model to build their reasoner on top of is still important.

Training

I’ve seen the public discussion about R1 frequently confuse the pre-training and reinforcement learning phases of the model’s training, so I want to draw a clear distinction between these two phases here.

Pre-training

The pre-training run for DeepSeek r1 was DeepSeek v3. The technical report for v3 gives a remarkable amount of detail about how they trained the model: they used mixed FP8 precision training on a cluster of 2048 H800 GPUs, and processing each trillion tokens of training data took them 3.7 days on this cluster, so about 180,000 H800 hours. They also say that their total training dataset size was 14.8 trillion tokens, implying a training cost of around 14.8 * 180,000 = 2.66 million H800 hours or around $5.3M if we price the cost of an H800 hour at $2.

There’s been a surprising amount of skepticism of these numbers, but they are if anything high for a model with this architecture trained in this way. A dataset size of 14.8 trillion tokens is reasonable and in line with other models of this scale. Assuming that’s valid, the pretraining of this model would have required 6 * (37 billion) * (14.8 trillion) = 3e24 FLOP. If we assume DeepSeek’s training cluster consists of H800s with the PCIe form factor, then each should be capable of 1.5e15 FP8 per second, and the implied model FLOP utilization (MFU) of DeepSeek v3’s 55 day training run ends up being around 23%.

One reason people have been doubting the 3e24 FLOP figure is because the model’s performance seems out of line with other models trained with a comparable amount of resources. For example, Llama 3 70B and its later iterations took around twice the compute to train and they significantly underperform DeepSeek v3 on benchmarks. The reason behind the difference in performance is algorithmic progress: we know Llama 3 70B lacks many of the key architectural innovations that went into creating DeepSeek v3, so it’s not a surprise that it would be less compute-efficient. We know less about what has happened on the data quality side, but I would not be surprised if DeepSeek also improved on the Llama series o

What went into training DeepSeek-R1? by cubefox

What went into training DeepSeek-R1? by cubefox

Share This Article

Newsletter

Architecture

Training

Pre-training

HackTech

Leave a comment Cancel reply

Editor's Choice

What went into training DeepSeek-R1? by cubefox

What went into training DeepSeek-R1? by cubefox

Share This Article

Newsletter

Architecture

Training

Pre-training

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter