DeepSeek-R1 is a big step forward in the open model ecosystem for AI with their latest model competing with OpenAI’s o1 on a variety of metrics. There is a lot of hype, and a lot of noise around the fact that they achieved this with much less money and compute.

Source: https://x.com/karpathy/status/1872362712958906460
Instead of learning about it from AI influencer* threads hyping up the release, I decided to make a reading list that links to a lot of the fundamental research papers. This list is meant to be slowly digested one paper at a time with a cup of hot coffee or tea next to a cozy fireplace, not while scrolling social media.
* not you Andrej, we love your threads
If you have been keeping up with the field, R1 doesn’t come much as a surprise. It was the natural progression of the research, and it is amazing that they decided to spend all that compute just to give the model weights to the community for free.
We have already covered a bunch of these topics already in our research paper club that gathers on Fridays over zoom. We go deep and don’t shy away from the math, but you will walk away having learned something. I try to break it down in as plain of speak as possible. If you want to join our learning journey, feel free to check out our events calendar below!
Oxen.ai · Events Calendar
View and subscribe to events from Oxen.ai on Luma. Build World-Class AI Datasets, Together. Track, iterate, collaborate on, & discover data in any format.
At it’s core, DeepSeek is built on a Transformer Neural Network architecture. If you aren’t familiar with Transformers, I’d start at some of these foundational papers from Google, OpenAI, Meta and Anthropic.
Attention Is All You Need
This paper introduced the Transformer architecture in the context of machine translation back in 2017, and kicked off the scaling laws trends that lead to GPT-2, GPT-3, ChatGPT, and now the DeepSeek models.
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
arXiv.orgAshish Vaswani
Language Models are Unsupervised Multitask Learners (GPT-2)
This paper showed the generalization of larger scale pre-training with a suite of models that today we would consider small. At the time this was a big deal showing that we no longer had to train specialized models for each task, but that this “unsupervised” learning approach could allow models to “multitask”.
There is also the GPT-3 Paper (Language Models are Few-Shot Learners), that introduces the idea of prompting LLMs. This paper mainly comments on how they scaled up the data and compute.
Training Language Models to Follow Instructions (InstructGPT)
The InstructGPT paper shows how OpenAI went from a pre-trained GPT-3 model to a ChatGPT-like model. They don’t explicitly call it ChatGPT in this paper, but if you read between the lines, this was either GPT-3.5 or ChatGPT. The core insight here was collecting data to train a reward model and using reinforcement learning to turn the raw pre-trained model into a useful chatbot that follows instructions.
Training language models to follow instructions with human feedback
Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
arXiv.orgLong Ouyang
Llama-3 Herd Of Models
The Llama-3 Herd of Models paper from Meta was the first big state of the art large language model release that competed with GPT-4. They released a 405B model and a suite of smaller models along with a technical report demystifying the inner workings of the training pipelines.
The Llama 3 Herd of Models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
arXiv.orgAaron Grattafiori
A Mathematical Framework For Transformers Circuits
Anthropic’s blog posts and papers are great for understanding the inner workings of Transformers. This paper dives into the mechanisms that make a Transformer work, starting with the smallest possible “circuit” and working their way up. They are long, and very detailed, but quite worth the read.
A Mathematical Framework for Transformer Circuits
Nelson Elhage∗†
DeepSeek’s R1 and OpenAI’s o1 both rely on internal “thought” tokens that contain the model’s internal reasoning. This behavior can be prompted for and trained into a model. Using these extra tokens as a scratch pad, models have been shown to solve multi-step problems and tackle more complex tasks. The following papers are good background on how the Chain of Thought reasoning research has progressed over the past few years.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
With prompting alone, this paper shows that you can get models to generate intermediate reasoning steps before coming to a final answer. The prompting improves model performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. They surpass the performance of the (at the time) state of the art fine-tuned GPT-3 model.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
arXiv.orgJason Wei
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
When language models produce text from left to right, token by token, if they make a mistake it is hard to backtrack or get the model to correct course. With the Tree of Thoughts paper they allow the model to consider multiple possible reasoning paths while self evaluating the choices to determine the next best choice of action. This is a more expensive technique because it requires many generations and many verifications, but shows the model is able to solve three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.
arXiv.orgShunyu Yao
The Prompt Report
This paper has a good survey on different “of Thought” papers, as well as many other prompting techniques. You could collate all the prompts and techniques from this paper to create some very interesting synthetic datasets to further train better and better models on….just sayin’.
The Prompt Report: A Systematic Survey of Prompting Techniques
Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, inc