
Mini-R1: Reproduce DeepSeek R1 “Aha Moment” by jonbaer
The release of Deepseek R1 shocked the industry. Why? Well, DeepSeek-R1 is an open model that rivals OpenAI’s o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach. They not only released the model, but also a research paper on how they did it.
In the paper they described an “aha moment” when using pure RL to train the model. During this phase, DeepSeek-R1-Zero (the first test of DeepSeek-R1) learns to allocate more thinking time to a problem by reevaluating its initial approach without any human feedback or data describing how to do it. They describe this as an “aha moment” as:
This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.
In this blog post we want to recreate the small “aha moment” of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game. We will train an open model using reinforcement learning trying to teach it self-verification and search abilities all on its own to solve the Countdown Game.
The Countdown game is a numbers puzzle where players use a set of randomly drawn numbers and basic arithmetic operations (+, -, ×, ÷) to reach or get as close as possible to a target number.
Target Number: 952
Available Numbers: 25, 50, 75, 100, 3, 6
(100 × (3 × 3)) + (50 + 6 / 3) = 952
The blog post includes an interactive code which you can run in a Jupyter Notebook on how to train a model using GRPO and Q-Lora. This is a great way to learn how to use TRL and GRPO, but it is very slow and requires a lot of compute. Additionally, I added a script and instructions to run the training on Node with multiple GPUs or a SLURM cluster.
- Setup the development environment
- Generate training samples with reasoning prefix from the Countdown Game
- Train the model using GRPO (Educational part)
- Distributed Training example for GRPO using Deepspeed and vLLM
- Results and Training Observations
Note: This blog is inspired by Jiayi Pan who initially explored the idea and proofed it with a small model.
But Before we start, let’s take a look at the Group Relative Policy Optimization (GRPO) and understand how it works.
Group Relative Policy Optimization (GRPO)
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm to improve the reasoning capabilities of LLMs. It was introduced in the DeepSeekMath paper in the context of mathematical reasoning. GRPO modifies the traditional Proximal Policy Optimization (PPO) by eliminating the need for a value function model. Instead, it estimates baselines from group scores, reducing memory usage and computational overhead. GRPO, now also used by the Qwen team, can be used with rule/binary-based Rewards as well as General Reward Models to improve models on helpfulness.
- Sampling: Generate multiple outputs for each prompt using the current policy
- Reward Scoring: Each generation is scored using a reward function, could be (rule-based or outcome-based)
- Advantage Calculation: The average reward of the generated outputs is used as a baseline. The advantage of each solution within the group is then computed relative to this baseline. The reward is normalized within a group.
- Policy Optimization: The policy tries to maximize the GRPO objective, which includes the calculated advantages and a KL divergence term. This is different from how PPO implements the KL term within the reward.
1. Setup the development environment
Our first step is to install Hugging Face Libraries and Pytorch, vllm, and trl, transformers and datasets. If you haven’t heard of trl yet, don’t worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.
# Install Pytorch & other libraries, make sure to match your GPU driver version
%pip install "torch==2.5.1" tensorboard "setuptools<71.0.0" --index-url h
1 Comment
rmrf100
this is really cool!