QwQ-32B: Embracing the Power of Reinforcement Learning by nwjsmith

Share This Article

Sed ut perspiciatis unde.

QWEN CHAT
Hugging Face
ModelScope
DEMO
DISCORD

Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods. Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models. For instance, DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning.

Our research explores the scalability of Reinforcement Learning (RL) and its impact on enhancing the intelligence of large language models. We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge. Furthermore, we have integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilizing tools and adapting its reasoning based on environmental feedback. These advancements not only demonstrate the transformative potential of RL but also pave the way for further innovations in the pursuit of artificial general intelligence.

QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat.

Performance

QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Reinforcement Learning

We began with a cold-start checkpoint and implemented a reinforcement learning (RL) scaling approach driven by outcome-based rewards. In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases. As training episodes progress, performance in both domains shows continuous improvement. After the first stage, we add another stage of RL for general capabilities. It is trained with rewards from general reward model and some rule-based verifiers. We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding.

Use QwQ-32B

Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word "strawberry""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_id

Post Author

iamronaldo

Posted March 5, 2025 at 7:42 pm

This is insane matching deepseek but 20x smaller?

0Likes Log in to Reply
Post Author

myky22

Posted March 5, 2025 at 8:02 pm

No bad.

I have tried it in a current project (Online Course) where Deepseek and Gemini have done a good job with a "stable" prompt and my impression is:
-Somewhat simplified but original answers

We will have to keep an eye on it

0Likes Log in to Reply
Post Author

gagan2020

Posted March 5, 2025 at 8:03 pm

Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.

These things are pretty interesting as they are developing. What US will do to retain its power?

BTW I am Indian and we are not even in the race as country. :(

0Likes Log in to Reply
Post Author

Alex-Programs

Posted March 5, 2025 at 8:05 pm

This is ridiculous. 32B and beating deepseek and o1. And yet I'm trying it out and, yeah, it seems pretty intelligent…

Remember when models this size could just about maintain a conversation?

0Likes Log in to Reply
Post Author

Leary

Posted March 5, 2025 at 8:20 pm

To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ.

0Likes Log in to Reply
Post Author

jaggs

Posted March 5, 2025 at 8:30 pm

Nice. Hard to tell whether it's really on a par with o1 or R1, but it's definitely very impressive for a 32B model.

0Likes Log in to Reply
Post Author

wbakst

Posted March 5, 2025 at 8:36 pm

actually insane how small the model is. they are only going to get better AND smaller. wild times

0Likes Log in to Reply
Post Author

bearjaws

Posted March 5, 2025 at 8:51 pm

Available on ollama now as well.

0Likes Log in to Reply
Post Author

nycdatasci

Posted March 5, 2025 at 9:40 pm

Wasn't this release in Nov 2024 as a "preview" with similarly impressive performance?
https://qwenlm.github.io/blog/qwq-32b-preview/

0Likes Log in to Reply
Post Author

kelsey98765431

Posted March 5, 2025 at 9:59 pm

first thoughts: wow this is a real reasoning model, not just llama variant with a sft. the chain of thought actually wwill go for a very long time on a seemingly simple question like writing a pi calculation in c. very interesting.

0Likes Log in to Reply
Post Author

manmal

Posted March 5, 2025 at 10:20 pm

I guess I won’t be needing that 512GB M3 Ultra after all.

0Likes Log in to Reply
Post Author

esafak

Posted March 5, 2025 at 10:21 pm

Impressive output but slow. I'd still pick Claude but ask QwQ for a second opinion.

0Likes Log in to Reply
Post Author

antirez

Posted March 5, 2025 at 10:33 pm

Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.

EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k — using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says:

<QwQ output>
But the problem is in this question. Wait perhaps I'm getting ahead of
myself.

Wait the user hasn't actually provided a specific task yet. Let me check
again.

The initial instruction says:

"Please act as an AI agent that can perform tasks… When responding,
first output a YAML data structure with your proposed action, then wait
for feedback before proceeding."

But perhaps this is part of a system prompt? Wait the user input here
seems to be just "You will be given a problem. Please reason step by
step…" followed by a possible task?
</QwQ>

Note: Ollama "/show info" shows that the context size set is correct.

0Likes Log in to Reply
Post Author

rvz

Posted March 5, 2025 at 10:48 pm

The AI race to zero continues to accelerate with downloadable free AI models which have already won the race and destroying closed source frontier AI models.

They are once again getting squeezed in the middle and this is even before Meta releases Llama 4.

0Likes Log in to Reply
Post Author

dr_dshiv

Posted March 5, 2025 at 11:00 pm

I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.

20x smaller than Deep Seek! How small can these go? What kind of hardware can run this?

0Likes Log in to Reply
Post Author

samstave

Posted March 5, 2025 at 11:38 pm

>>In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases

—

They should call this the siphon/sifter model of RL.

You siphon only the initial domains, then sift to the solution….

0Likes Log in to Reply
Post Author

daemonologist

Posted March 5, 2025 at 11:47 pm

It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported – it sometimes loses the plot after producing a lot of reasoning tokens.

Overall though quite impressive if you're not in a hurry.

0Likes Log in to Reply
Post Author

TheArcane

Posted March 5, 2025 at 11:49 pm

chat.qwenlm.ai has quickly risen to the preferred choice for all my LLM needs. As accurate as Deepseek v3, but without the server issues.

This makes it even better!

0Likes Log in to Reply
Post Author

dulakian

Posted March 5, 2025 at 11:59 pm

My informal testing puts it just under Deepseek-R1. Very impressive for 32B. It maybe thinks a bit too much for my taste. In some of my tests the thinking tokens were 10x the size of the final answer. I am eager to test it with function calling over the weekend.

0Likes Log in to Reply

QwQ-32B: Embracing the Power of Reinforcement Learning by nwjsmith

QwQ-32B: Embracing the Power of Reinforcement Learning by nwjsmith

Share This Article

Newsletter

Performance

Reinforcement Learning

Use QwQ-32B

HackTech

19 Comments

iamronaldo

myky22

gagan2020

Alex-Programs

Leary

jaggs

wbakst

bearjaws

nycdatasci

kelsey98765431

manmal

esafak

antirez

rvz

dr_dshiv

samstave

daemonologist

TheArcane

dulakian

Leave a comment Cancel reply

Editor's Choice

QwQ-32B: Embracing the Power of Reinforcement Learning by nwjsmith

QwQ-32B: Embracing the Power of Reinforcement Learning by nwjsmith

Share This Article

Newsletter

Performance

Reinforcement Learning

Use QwQ-32B

19 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter