Skip to content Skip to footer

OpenAI o3 and o4-mini by maheshrijal

63 Comments

  • Post Author
    thm
    Posted April 16, 2025 at 5:04 pm

    I'm starting to be reminded of the razor blade business.

  • Post Author
    testfrequency
    Posted April 16, 2025 at 5:11 pm

    As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.

  • Post Author
    brap
    Posted April 16, 2025 at 5:11 pm

    Where's the comparison with Gemini 2.5 Pro?

  • Post Author
    jdross
    Posted April 16, 2025 at 5:12 pm

    The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating

  • Post Author
    typs
    Posted April 16, 2025 at 5:13 pm

    I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?

  • Post Author
    morkalork
    Posted April 16, 2025 at 5:13 pm

    If the ai is smart, why not have it choose the model for the user

  • Post Author
    firejake308
    Posted April 16, 2025 at 5:14 pm

    Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?

  • Post Author
    originalvichy
    Posted April 16, 2025 at 5:15 pm

    Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?

  • Post Author
    ksylvest
    Posted April 16, 2025 at 5:15 pm

    Are these available via the API? I'm getting back 'model_not_found' when testing.

  • Post Author
    planb
    Posted April 16, 2025 at 5:15 pm

    What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing – maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.

  • Post Author
    falleng0d
    Posted April 16, 2025 at 5:15 pm

    Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.

  • Post Author
    zapnuk
    Posted April 16, 2025 at 5:16 pm

    Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

    Lets see what the pricing looks like.

  • Post Author
    burke
    Posted April 16, 2025 at 5:16 pm

    It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.

  • Post Author
    andrethegiant
    Posted April 16, 2025 at 5:17 pm

    Buried in the article, a new CLI for coding:

    > Codex CLI is fully open-source at https://github.com/openai/codex today.

  • Post Author
    behnamoh
    Posted April 16, 2025 at 5:19 pm

    OpenAI be like:

        o1, o1-mini,
        o1-pro, o3,
        o4-mini, gpt-4,
        gpt-4o, gpt-4-turbo,
        gpt-4.5, gpt-4.1,
        gpt-4o-mini, gpt-4.1-mini,
        gpt-4.1-nano, gpt-3.5-turbo

  • Post Author
    xqcgrek2
    Posted April 16, 2025 at 5:19 pm

    Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5

  • Post Author
    georgewsinger
    Posted April 16, 2025 at 5:20 pm

    Very impressive! But under arguably the most important benchmark — SWE-bench verified for real-world coding tasks — Claude 3.7 still remains the champion.[1]

    Incredible how resilient Claude models have been for best-in-coding class.

    [1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

  • Post Author
    oofbaroomf
    Posted April 16, 2025 at 5:20 pm

    Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.

  • Post Author
    rahimnathwani
    Posted April 16, 2025 at 5:20 pm

      ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.
    

    I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).

  • Post Author
    ApolloFortyNine
    Posted April 16, 2025 at 5:21 pm

    Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.

    They even provide a description in the UI of each before you select it, and it defaults to a model for you.

    If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.

  • Post Author
    davidkunz
    Posted April 16, 2025 at 5:21 pm

    I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.

  • Post Author
    meetpateltech
    Posted April 16, 2025 at 5:22 pm

    o3 is cheaper than o1. (per 1M tokens)

    • o3 Pricing:

      - Input: $10.00  
    
      - Cached Input: $2.50  
    
      - Output: $40.00
    

    • o1 Pricing:

      - Input: $15.00  
    
      - Cached Input: $7.50  
    
      - Output: $60.00
    

    o4-mini pricing remains the same as o3-mini.

  • Post Author
    ben_w
    Posted April 16, 2025 at 5:22 pm

    4o and o4 at the same time. Excellent work on the product naming, whoever did that.

  • Post Author
    evaneykelen
    Posted April 16, 2025 at 5:22 pm

    A suggestion for OpenAI to create more meaningful model names:

    {Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}

    Where:

    * Size is XS/S/M/L/XL/XXL to indicate overall capability level

    * Quarter/Year like Q2-25

    * Speed/Accuracy indicated as Fast/Balanced/Precise

    * Optional specialty tag like Code/Vision/Science/etc

    Example model names:

    * L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)

    * M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)

  • Post Author
    _fat_santa
    Posted April 16, 2025 at 5:24 pm

    So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

    This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

  • Post Author
    mentalgear
    Posted April 16, 2025 at 5:27 pm

    I have doubts whether the live stream was really live.

    During the live-stream the subtitles are shown line by line.

    When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.

    Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.

  • Post Author
    EcommerceFlow
    Posted April 16, 2025 at 5:29 pm

    A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.

    Good thing I stopped working a few hours ago

    EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(

  • Post Author
    Workaccount2
    Posted April 16, 2025 at 5:29 pm

    o4-mini, not to be confused with 4o-mini.

  • Post Author
    spencersolberg
    Posted April 16, 2025 at 5:32 pm

    The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus

  • Post Author
    oofbaroomf
    Posted April 16, 2025 at 5:33 pm

    When are they going to release o3-high? I don't think it's in the API, and I certainly don't see it in the web app (Pro).

  • Post Author
    carlita_express
    Posted April 16, 2025 at 5:33 pm

    > we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.

    Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?

  • Post Author
    jcynix
    Posted April 16, 2025 at 5:34 pm

    To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.

    GPT-4o mini: The new moon in August 2025 will occur on August 12.

    Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.

    Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.

    o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). […]

    Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. […]

    I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.

    And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."

  • Post Author
    kumarm
    Posted April 16, 2025 at 5:39 pm

    Anyone got codex working? After installing and setting up API Key I get this error :

        system
          OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.

    ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  • Post Author
    bratao
    Posted April 16, 2025 at 5:39 pm

    Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.

  • Post Author
    oofbaroomf
    Posted April 16, 2025 at 5:40 pm

    Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.

  • Post Author
    pton_xd
    Posted April 16, 2025 at 5:41 pm

    This reminds me of keeping up with all the latest JavaScript framework trivia circa the ~2010s

  • Post Author
    WhitneyLand
    Posted April 16, 2025 at 5:42 pm

    So it looks like no increase in context window size since it’s not mentioned anywhere.

    I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.

  • Post Author
    basisword
    Posted April 16, 2025 at 5:42 pm

    The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user – I understand the different pricing on the API side helps people make decisions.

  • Post Author
    iamronaldo
    Posted April 16, 2025 at 5:47 pm
  • Post Author
    jawiggins
    Posted April 16, 2025 at 5:58 pm

    In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.

    To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.

    Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.

  • Post Author
    eric-p7
    Posted April 16, 2025 at 6:00 pm

    Babe wake up a new LLM just dropped.

  • Post Author
    simianwords
    Posted April 16, 2025 at 6:00 pm

    I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.

  • Post Author
    Topfi
    Posted April 16, 2025 at 6:08 pm

    I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.

  • Post Author
    neya
    Posted April 16, 2025 at 6:19 pm

    The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.

  • Post Author
    I_am_tiberius
    Posted April 16, 2025 at 6:20 pm

    What is again the advantage of pro over plus subscriptions?

  • Post Author
    djohnston
    Posted April 16, 2025 at 6:21 pm

    Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.

  • Post Author
    osigurdson
    Posted April 16, 2025 at 6:22 pm

    I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).

    Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.

  • Post Author
    pcdoodle
    Posted April 16, 2025 at 6:29 pm

    It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?

    It has been getting better IMO.

  • Post Author
    fpgaminer
    Posted April 16, 2025 at 6:30 pm

    On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.

    Both seem to be better at prompt following and have more up to date knowledge.

    But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.

  • Post Author
    taytus
    Posted April 16, 2025 at 6:35 pm

    This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1

    Why are they doing this?

  • Post Author
    rsanheim
    Posted April 16, 2025 at 6:42 pm

    `ETOOMANYMODELS`

    Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?

    I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.

    I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:

    * agent & tool based dev (cloud) – [top 3 models]
    * agent & tool based dev (local) – m1, m2, m,3
    * code review / high level analysis – …
    * general tech questions – …
    * technical writing (ADRs, needs assessments, etc) – …

    Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).

  • Post Author
    Sol-
    Posted April 16, 2025 at 6:44 pm

    Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.

  • Post Author
    sbochins
    Posted April 16, 2025 at 6:51 pm

    So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.

  • Post Author
    tymscar
    Posted April 16, 2025 at 6:52 pm

    Gave Codex a go with o4-mini and it's disappointing…
    Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools:
    https://xcancel.com/Tymscar/status/1912578655378628847

  • Post Author
    dr_kiszonka
    Posted April 16, 2025 at 6:59 pm

    I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)

  • Post Author
    iandanforth
    Posted April 16, 2025 at 7:01 pm

    o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.

    So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.

  • Post Author
    erikw
    Posted April 16, 2025 at 7:06 pm

    Interesting… I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.

  • Post Author
    AcerbicZero
    Posted April 16, 2025 at 7:18 pm

    I can't even get ChatGPT to tell me which chatgpt to use.

  • Post Author
    lubitelpospat
    Posted April 16, 2025 at 7:41 pm

    Sooo… are any of these (or their distils) getting open-sourced/open-weighted?

  • Post Author
    croemer
    Posted April 16, 2025 at 7:42 pm

    I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?

  • Post Author
    jumploops
    Posted April 16, 2025 at 8:11 pm

    The big step function here seems to be RL on tool calling.

    Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).

    OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).

    o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.

    My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7

    tl;dr – GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.

    [0]https://www.anthropic.com/engineering/building-effective-age…

    [1]https://github.com/1rgs/claude-code-proxy

    [2]https://openai.com/index/openai-codex/

  • Post Author
    waltercool
    Posted April 16, 2025 at 8:18 pm

    [dead]

  • Post Author
    siva7
    Posted April 16, 2025 at 8:23 pm

    So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.