No Hype DeepSeek-R1 Reading List by Philpax

Share This Article

Sed ut perspiciatis unde.

DeepSeek-R1 is a big step forward in the open model ecosystem for AI with their latest model competing with OpenAI’s o1 on a variety of metrics. There is a lot of hype, and a lot of noise around the fact that they achieved this with much less money and compute.

Source: https://x.com/karpathy/status/1872362712958906460

Instead of learning about it from AI influencer* threads hyping up the release, I decided to make a reading list that links to a lot of the fundamental research papers. This list is meant to be slowly digested one paper at a time with a cup of hot coffee or tea next to a cozy fireplace, not while scrolling social media.

* not you Andrej, we love your threads

If you have been keeping up with the field, R1 doesn’t come much as a surprise. It was the natural progression of the research, and it is amazing that they decided to spend all that compute just to give the model weights to the community for free.

We have already covered a bunch of these topics already in our research paper club that gathers on Fridays over zoom. We go deep and don’t shy away from the math, but you will walk away having learned something. I try to break it down in as plain of speak as possible. If you want to join our learning journey, feel free to check out our events calendar below!

At it’s core, DeepSeek is built on a Transformer Neural Network architecture. If you aren’t familiar with Transformers, I’d start at some of these foundational papers from Google, OpenAI, Meta and Anthropic.

Attention Is All You Need

This paper introduced the Transformer architecture in the context of machine translation back in 2017, and kicked off the scaling laws trends that lead to GPT-2, GPT-3, ChatGPT, and now the DeepSeek models.

Language Models are Unsupervised Multitask Learners (GPT-2)

This paper showed the generalization of larger scale pre-training with a suite of models that today we would consider small. At the time this was a big deal showing that we no longer had to train specialized models for each task, but that this “unsupervised” learning approach could allow models to “multitask”.

Link: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

There is also the GPT-3 Paper (Language Models are Few-Shot Learners), that introduces the idea of prompting LLMs. This paper mainly comments on how they scaled up the data and compute.

Training Language Models to Follow Instructions (InstructGPT)

The InstructGPT paper shows how OpenAI went from a pre-trained GPT-3 model to a ChatGPT-like model. They don’t explicitly call it ChatGPT in this paper, but if you read between the lines, this was either GPT-3.5 or ChatGPT. The core insight here was collecting data to train a reward model and using reinforcement learning to turn the raw pre-trained model into a useful chatbot that follows instructions.

Llama-3 Herd Of Models

The Llama-3 Herd of Models paper from Meta was the first big state of the art large language model release that competed with GPT-4. They released a 405B model and a suite of smaller models along with a technical report demystifying the inner workings of the training pipelines.

A Mathematical Framework For Transformers Circuits

Anthropic’s blog posts and papers are great for understanding the inner workings of Transformers. This paper dives into the mechanisms that make a Transformer work, starting with the smallest possible “circuit” and working their way up. They are long, and very detailed, but quite worth the read.

A Mathematical Framework for Transformer Circuits

Nelson Elhage∗†

DeepSeek’s R1 and OpenAI’s o1 both rely on internal “thought” tokens that contain the model’s internal reasoning. This behavior can be prompted for and trained into a model. Using these extra tokens as a scratch pad, models have been shown to solve multi-step problems and tackle more complex tasks. The following papers are good background on how the Chain of Thought reasoning research has progressed over the past few years.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

With prompting alone, this paper shows that you can get models to generate intermediate reasoning steps before coming to a final answer. The prompting improves model performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. They surpass the performance of the (at the time) state of the art fine-tuned GPT-3 model.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

When language models produce text from left to right, token by token, if they make a mistake it is hard to backtrack or get the model to correct course. With the Tree of Thoughts paper they allow the model to consider multiple possible reasoning paths while self evaluating the choices to determine the next best choice of action. This is a more expensive technique because it requires many generations and many verifications, but shows the model is able to solve three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.

The Prompt Report

This paper has a good survey on different “of Thought” papers, as well as many other prompting techniques. You could collate all the prompts and techniques from this paper to create some very interesting synthetic datasets to further train better and better models on….just sayin’.

The Prompt Report: A Systematic Survey of Prompting Techniques

Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, inc

No Hype DeepSeek-R1 Reading List by Philpax

No Hype DeepSeek-R1 Reading List by Philpax

Share This Article

Newsletter

Attention Is All You Need

Language Models are Unsupervised Multitask Learners (GPT-2)

Training Language Models to Follow Instructions (InstructGPT)

Llama-3 Herd Of Models

A Mathematical Framework For Transformers Circuits

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

The Prompt Report

HackTech

Leave a comment Cancel reply

Editor's Choice

No Hype DeepSeek-R1 Reading List by Philpax

No Hype DeepSeek-R1 Reading List by Philpax

Share This Article

Newsletter

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter