The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating
I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?
What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing – maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.
Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.
Very impressive! But under arguably the most important benchmark — SWE-bench verified for real-world coding tasks — Claude 3.7 still remains the champion.[1]
Incredible how resilient Claude models have been for best-in-coding class.
[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).
ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.
I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).
So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.
This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.
I have doubts whether the live stream was really live.
During the live-stream the subtitles are shown line by line.
When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.
Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.
> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.
Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?
To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.
GPT-4o mini: The new moon in August 2025 will occur on August 12.
Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.
Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.
o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). […]
Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. […]
I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.
And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user – I understand the different pricing on the API side helps people make decisions.
In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.
To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.
Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.
I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.
I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).
Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.
It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?
On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.
Both seem to be better at prompt following and have more up to date knowledge.
But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.
Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?
I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.
I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:
* agent & tool based dev (cloud) – [top 3 models]
* agent & tool based dev (local) – m1, m2, m,3
* code review / high level analysis – …
* general tech questions – …
* technical writing (ADRs, needs assessments, etc) – …
Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).
Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.
So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.
Gave Codex a go with o4-mini and it's disappointing…
Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools: https://xcancel.com/Tymscar/status/1912578655378628847
I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)
o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.
So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.
Interesting… I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.
The big step function here seems to be RL on tool calling.
Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).
OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).
o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.
My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7
tl;dr – GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.
So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?
Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.
Our site uses cookies. Learn more about our use of cookies: cookie policyACCEPTREJECT
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
63 Comments
thm
I'm starting to be reminded of the razor blade business.
testfrequency
As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.
brap
Where's the comparison with Gemini 2.5 Pro?
jdross
The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating
typs
I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?
morkalork
If the ai is smart, why not have it choose the model for the user
firejake308
Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?
originalvichy
Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?
ksylvest
Are these available via the API? I'm getting back 'model_not_found' when testing.
planb
What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing – maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.
falleng0d
Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.
zapnuk
Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.
Lets see what the pricing looks like.
burke
It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.
andrethegiant
Buried in the article, a new CLI for coding:
> Codex CLI is fully open-source at https://github.com/openai/codex today.
behnamoh
OpenAI be like:
xqcgrek2
Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5
georgewsinger
Very impressive! But under arguably the most important benchmark — SWE-bench verified for real-world coding tasks — Claude 3.7 still remains the champion.[1]
Incredible how resilient Claude models have been for best-in-coding class.
[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).
oofbaroomf
Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.
rahimnathwani
I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).
ApolloFortyNine
Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.
They even provide a description in the UI of each before you select it, and it defaults to a model for you.
If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.
davidkunz
I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.
meetpateltech
o3 is cheaper than o1. (per 1M tokens)
• o3 Pricing:
• o1 Pricing:
o4-mini pricing remains the same as o3-mini.
ben_w
4o and o4 at the same time. Excellent work on the product naming, whoever did that.
evaneykelen
A suggestion for OpenAI to create more meaningful model names:
{Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}
Where:
* Size is XS/S/M/L/XL/XXL to indicate overall capability level
* Quarter/Year like Q2-25
* Speed/Accuracy indicated as Fast/Balanced/Precise
* Optional specialty tag like Code/Vision/Science/etc
Example model names:
* L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)
* M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)
_fat_santa
So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.
This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.
mentalgear
I have doubts whether the live stream was really live.
During the live-stream the subtitles are shown line by line.
When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.
Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.
EcommerceFlow
A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.
Good thing I stopped working a few hours ago
EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(
Workaccount2
o4-mini, not to be confused with 4o-mini.
spencersolberg
The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus
oofbaroomf
When are they going to release o3-high? I don't think it's in the API, and I certainly don't see it in the web app (Pro).
carlita_express
> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.
Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?
jcynix
To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.
GPT-4o mini: The new moon in August 2025 will occur on August 12.
Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.
Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.
o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). […]
Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. […]
I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.
And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
kumarm
Anyone got codex working? After installing and setting up API Key I get this error :
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
bratao
Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.
oofbaroomf
Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.
pton_xd
This reminds me of keeping up with all the latest JavaScript framework trivia circa the ~2010s
WhitneyLand
So it looks like no increase in context window size since it’s not mentioned anywhere.
I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.
basisword
The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user – I understand the different pricing on the API side helps people make decisions.
iamronaldo
Tyler cowen seems convinced
https://marginalrevolution.com/marginalrevolution/2025/04/o3…
jawiggins
In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.
To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.
Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.
eric-p7
Babe wake up a new LLM just dropped.
simianwords
I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.
Topfi
I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.
neya
The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.
I_am_tiberius
What is again the advantage of pro over plus subscriptions?
djohnston
Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.
osigurdson
I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).
Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.
pcdoodle
It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?
It has been getting better IMO.
fpgaminer
On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.
Both seem to be better at prompt following and have more up to date knowledge.
But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.
taytus
This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1
Why are they doing this?
rsanheim
`ETOOMANYMODELS`
Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?
I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.
I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:
* agent & tool based dev (cloud) – [top 3 models]
* agent & tool based dev (local) – m1, m2, m,3
* code review / high level analysis – …
* general tech questions – …
* technical writing (ADRs, needs assessments, etc) – …
Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).
Sol-
Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.
sbochins
So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.
tymscar
Gave Codex a go with o4-mini and it's disappointing…
Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools:
https://xcancel.com/Tymscar/status/1912578655378628847
dr_kiszonka
I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)
iandanforth
o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.
So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.
erikw
Interesting… I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.
AcerbicZero
I can't even get ChatGPT to tell me which chatgpt to use.
lubitelpospat
Sooo… are any of these (or their distils) getting open-sourced/open-weighted?
croemer
I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?
jumploops
The big step function here seems to be RL on tool calling.
Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).
OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).
o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.
My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7
tl;dr – GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.
[0]https://www.anthropic.com/engineering/building-effective-age…
[1]https://github.com/1rgs/claude-code-proxy
[2]https://openai.com/index/openai-codex/
waltercool
[dead]
siva7
So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?