QWEN CHAT
Hugging Face
ModelScope
DEMO
DISCORD
Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods. Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models. For instance, DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning.
Our research explores the scalability of Reinforcement Learning (RL) and its impact on enhancing the intelligence of large language models. We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This remarkable outcome underscores the effectiveness of RL when applied to robust foundation models pretrained on extensive world knowledge. Furthermore, we have integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilizing tools and adapting its reasoning based on environmental feedback. These advancements not only demonstrate the transformative potential of RL but also pave the way for further innovations in the pursuit of artificial general intelligence.
QwQ-32B is open-weight in Hugging Face and ModelScope under the Apache 2.0 license and is accessible via Qwen Chat.
Performance
QwQ-32B is evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results below highlight QwQ-32B’s performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Reinforcement Learning
We began with a cold-start checkpoint and implemented a reinforcement learning (RL) scaling approach driven by outcome-based rewards. In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases. As training episodes progress, performance in both domains shows continuous improvement. After the first stage, we add another stage of RL for general capabilities. It is trained with rewards from general reward model and some rule-based verifiers. We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding.
Use QwQ-32B
Below are brief examples demonstrating how to use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/QwQ-32B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "How many r's are in the word "strawberry""
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_id
19 Comments
iamronaldo
This is insane matching deepseek but 20x smaller?
myky22
No bad.
I have tried it in a current project (Online Course) where Deepseek and Gemini have done a good job with a "stable" prompt and my impression is:
-Somewhat simplified but original answers
We will have to keep an eye on it
gagan2020
Chinese strategy is open-source software part and earn on robotics part. And, They are already ahead of everyone in that game.
These things are pretty interesting as they are developing. What US will do to retain its power?
BTW I am Indian and we are not even in the race as country. :(
Alex-Programs
This is ridiculous. 32B and beating deepseek and o1. And yet I'm trying it out and, yeah, it seems pretty intelligent…
Remember when models this size could just about maintain a conversation?
Leary
To test: https://chat.qwen.ai/ and select Qwen2.5-plus, then toggle QWQ.
jaggs
Nice. Hard to tell whether it's really on a par with o1 or R1, but it's definitely very impressive for a 32B model.
wbakst
actually insane how small the model is. they are only going to get better AND smaller. wild times
bearjaws
Available on ollama now as well.
nycdatasci
Wasn't this release in Nov 2024 as a "preview" with similarly impressive performance?
https://qwenlm.github.io/blog/qwq-32b-preview/
kelsey98765431
first thoughts: wow this is a real reasoning model, not just llama variant with a sft. the chain of thought actually wwill go for a very long time on a seemingly simple question like writing a pi calculation in c. very interesting.
manmal
I guess I won’t be needing that 512GB M3 Ultra after all.
esafak
Impressive output but slow. I'd still pick Claude but ask QwQ for a second opinion.
antirez
Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.
EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k — using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says:
<QwQ output>
But the problem is in this question. Wait perhaps I'm getting ahead of
myself.
Wait the user hasn't actually provided a specific task yet. Let me check
again.
The initial instruction says:
"Please act as an AI agent that can perform tasks… When responding,
first output a YAML data structure with your proposed action, then wait
for feedback before proceeding."
But perhaps this is part of a system prompt? Wait the user input here
seems to be just "You will be given a problem. Please reason step by
step…" followed by a possible task?
</QwQ>
Note: Ollama "/show info" shows that the context size set is correct.
rvz
The AI race to zero continues to accelerate with downloadable free AI models which have already won the race and destroying closed source frontier AI models.
They are once again getting squeezed in the middle and this is even before Meta releases Llama 4.
dr_dshiv
I love that emphasizing math learning and coding leads to general reasoning skills. Probably works the same in humans, too.
20x smaller than Deep Seek! How small can these go? What kind of hardware can run this?
samstave
>>In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases
—
They should call this the siphon/sifter model of RL.
You siphon only the initial domains, then sift to the solution….
daemonologist
It says "wait" (as in "wait, no, I should do X") so much while reasoning it's almost comical. I also ran into the "catastrophic forgetting" issue that others have reported – it sometimes loses the plot after producing a lot of reasoning tokens.
Overall though quite impressive if you're not in a hurry.
TheArcane
chat.qwenlm.ai has quickly risen to the preferred choice for all my LLM needs. As accurate as Deepseek v3, but without the server issues.
This makes it even better!
dulakian
My informal testing puts it just under Deepseek-R1. Very impressive for 32B. It maybe thinks a bit too much for my taste. In some of my tests the thinking tokens were 10x the size of the final answer. I am eager to test it with function calling over the weekend.