
Phi-4 Reasoning Models by meetpateltech
Microsoft continues to add to the conversation by unveiling its newest models, Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning.
A new era of AI
One year ago, Microsoft introduced small language models (SLMs) to customers with the release of Phi-3 on Azure AI Foundry, leveraging research on SLMs to expand the range of efficient AI models and tools available to customers.
Today, we are excited to introduce Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—marking a new era for small language models and once again redefining what is possible with small and efficient AI.
Reasoning models, the next step forward
Reasoning models are trained to leverage inference-time scaling to perform complex tasks that demand multi-step decomposition and internal reflection. They excel in mathematical reasoning and are emerging as the backbone of agentic applications with complex, multi-faceted tasks. Such capabilities are typically found only in large frontier models. Phi-reasoning models introduce a new category of small language models. Using distillation, reinforcement learning, and high-quality data, these models balance size and performance. They are small enough for low-latency environments yet maintain strong reasoning capabilities that rival much bigger models. This blend allows even resource-limited devices to perform complex reasoning tasks efficiently.
Phi-4-reasoning and Phi-4-reasoning-plus
Phi-4-reasoning is a 14-billion parameter open-weight reasoning model that rivals much larger models on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated reasoning demonstrations from OpenAI o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage additional inference-time compute. The model demonstrates that meticulous data curation and high-quality synthetic datasets allow smaller models to compete with larger counterparts.
Phi-4-reasoning-plus builds upon Phi-4-reasoning capabilities, further trained with reinforcement learning to utilize more inference-time compute, using 1.5x more tokens than Phi-4-reasoning, to deliver higher accuracy.
Despite their significantly smaller size, both models achieve better performance than OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B at most benchmarks, including mathematical reasoning and Ph.D. level science questions. They achieve performance better than the full DeepSeek-R1 model (with 671-billion parameters) on the AIME 2025 test, the 2025 qualifier for the USA Math Olympiad. Both models are available on Azure AI Foundry and HuggingFace.

9 Comments
refulgentis
These look quite incredible. I work on a llama.cpp GUI wrapper and its quite surprising to see how well Microsoft's Phi-4 releases set it apart as the only competition below ~7B, it'll probably take a year for the FOSS community to implement and digest it completely (it can do multimodal! TTS! STT! Conversation!)
gthompson512
Sorry if this comment is outdated or ill-informed, but it is hard to follow the current news. Do the Phi models still have issues with training on the test set, or have they fixed that?
csdvrx
Is anyone here using phi-4 multimodal for image-to-text tasks?
The phi models often punch above their weight, and I got curious about the vision models after reading https://unsloth.ai/blog/phi4 stories of finetuning
Since lmarena.ai only has the phi-4 text model, I've tried "phi-4 multimodal instruct" from openrouter.ai.
However, the results I get are far below what I would have expected.
Is there any "Microsoft validated" source (like https://chat.qwen.ai/c/guest for qwen) to easily try phi4 vision?
adt
https://lifearchitect.ai/models-table/
danielhanchen
We uploaded GGUFs for anyone who wants to run them locally.
[EDIT] – I fixed all chat templates so no need for –jinja as at 10:00PM SF time.
Phi-4-mini-reasoning GGUF: https://huggingface.co/unsloth/Phi-4-mini-reasoning-GGUF
Phi-4-reasoning-plus-GGUF: https://huggingface.co/unsloth/Phi-4-reasoning-plus-GGUF
justanotheratom
is there a well-established tool-chain for finetuning these models?
behnamoh
what's Microsoft's plan here? on one hand they have a huge share in OpenAI and Mistral, and on the other hand they release open weight models.
gitroom
Honestly the Phi-4 stuff is starting to get real interesting for me. Im still confused about Microsofts whole play here, but thats kind of what makes it fun to watch.
mrbonner
The example prompt for reasoning model that never fails to amuse me: "How amy letter 'r's in the word 'strrawberrry'?"
Phi-4-mini-reasoning: thought for 2 min 3 sec
<think>
Okay, let's see here. The user wants to know how many times the letter 'r' appears in the word 'strrawberrry'. Hmm, first I need to make sure I'm spelling the word correctly. Let me write it out slowly: S-T-R-A-W-B-E-R-R-Y.
…
But I'm overcomplicating this. Given that the user explicitly writes 'strawberrry' with multiple r's (as indicated by texttt{rrrr}), I think they mean there are three R's in that word. Therefore, counting them would give an answer of 3.
</think>
The word "strrawberrry" contains *5* occurrences of the letter 'r'.
To determine this:
1. Break down the spelling: S-T-R-A-W-B-E-R-R-Y.
2. Identify each 'R':
– The first R is in the "straw" part (position 3).
– Additional Rs appear at positions corresponding to three instances of 'r' due to repetition.
*Answer:* boxed{5}
I have no idea how it came to that conclusion with those thoughts lol.
The full transcript: https://pastebin.com/tyF5aSb9