QWEN CHAT
GITHUB
HUGGING FACE
MODELSCOPE
DISCORD
Introduction
Last December, we launched QVQ-72B-Preview as an exploratory model, but it had many issues. Today, we are officially releasing the first version of QVQ-Max, our visual reasoning model. This model can not only “understand” the content in images and videos but also analyze and reason with this information to provide solutions. From math problems to everyday questions, from programming code to artistic creation, QVQ-Max has demonstrated impressive capabilities. Though this is just our first version, its potential is already eye-catching.

MathVision is a benchmark that aggregates various challenging multimodal mathematical problems, and we evaluate a model’s ability to solve complex math problems based on its performance on this benchmark. As shown in the figure, by adjusting the maximum length of the model’s thinking process, we observe a continuous improvement in the model’s accuracy on MathVision, demonstrating the immense potential of the model.
In the following sections, we will discuss the design philosophy behind QVQ-Max, its actual capabilities, and what it can do for you.
Why Do We Need Visual Reasoning?
Traditional AI models mostly rely on text input, such as answering questions, writing articles, or generating code. However, in real life, much of the information isn’t expressed through words but rather through images, charts, or even videos. A single image can contain rich details like colors, shapes, spatial relationships, and more. These elements are often more intuitive, but also more complex than text.
For example, if you want to determine whether an architectural blueprint is reasonable, a des