Skip to content Skip to footer
0 items - $0.00 0

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems by colinprince

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems by colinprince

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems by colinprince

27 Comments

  • Post Author
    chasing0entropy
    Posted February 24, 2025 at 4:59 am

    The models were restricted from accessing the internet and forced to develop their own solutions internally.

    I think researchers will find that human coders are unable to solve most coding problems without access to the internet.

  • Post Author
    aszantu
    Posted February 24, 2025 at 5:01 am

    it's so much easier to learn from examples than from documentation in my opinion, documentation is, what I use when I want to know additional parameters or downsides of a functionality. I'm no coder though.

  • Post Author
    _def
    Posted February 24, 2025 at 5:04 am

    > even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.

    "low/high level" starts to lose its meaning to me because it gets used in opposite ways

  • Post Author
    nostrebored
    Posted February 24, 2025 at 5:05 am

    This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.

    Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.

    But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.

    The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.

    And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.

  • Post Author
    WiSaGaN
    Posted February 24, 2025 at 5:10 am

    So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.

  • Post Author
    pton_xd
    Posted February 24, 2025 at 5:15 am

    Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.

  • Post Author
    mrayycombi
    Posted February 24, 2025 at 5:19 am

    Despite the lack luster coding performance, AI has PROVEN its able to provide a rationale for profit taking job cuts, layoffs, reduced stock grants, and increased executive bonuses.

    So it's not ALL bad news.

  • Post Author
    marban
    Posted February 24, 2025 at 5:22 am

    I prompted up a very basic Flask scaffold via Windsurf and once it reached a certain code size, it just started to remove or weirdly rewrite old parts to handle the context. ("You're right let's move that back in"). Didn't end well.

  • Post Author
    blindriver
    Posted February 24, 2025 at 5:22 am

    They should feed it bootcamp study materials and Cracking the Coding Interview book in order to improve its ability to code.

  • Post Author
    axelfontaine
    Posted February 24, 2025 at 5:24 am

    … so far

  • Post Author
    jasonthorsness
    Posted February 24, 2025 at 5:27 am

    Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.

  • Post Author
    realitysballs
    Posted February 24, 2025 at 5:29 am

    I believe the outcome of this type of article is actually positive. The ‘SWE-Lancer’ benchmark provides visibility into a more pragmatic assessment of LLM capabilities.

    Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.

  • Post Author
    rvz
    Posted February 24, 2025 at 5:33 am

    The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.

    Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.

    If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?

  • Post Author
    m3kw9
    Posted February 24, 2025 at 5:35 am

    It solved a lot of mines

  • Post Author
    spartanatreyu
    Posted February 24, 2025 at 5:35 am

    Link to the original paper: https://arxiv.org/pdf/2502.12115

    TL;DR:

    They tested with programming tasks and manager's tasks.

    The vast majority of tasks given require bugfixes.

    Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.

    The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)

    Personally, I have other concerns:

    – A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through

    – An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.

    – LLM use increases code churn in a code base

    – Increased code churn is known to be bad the health of projects

  • Post Author
    jr-ai-interview
    Posted February 24, 2025 at 5:40 am

    This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.

  • Post Author
    ldjkfkdsjnv
    Posted February 24, 2025 at 5:43 am

    I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.

    The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.

    Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.

  • Post Author
    yieldcrv
    Posted February 24, 2025 at 5:43 am

    I saw this

    still, chain of thought is great for LeetCode 75

    Since interviewers “want to see how you think” (and get the right answer in less time than other candidates on average)

    I can now see how you’re supposed to think (and get the right answer in less time than other candidates on average, for now)

  • Post Author
    mohsen1
    Posted February 24, 2025 at 5:46 am

    I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality

    I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.

  • Post Author
    pzo
    Posted February 24, 2025 at 5:49 am

    > The models weren't allowed to access the internet

    How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?

    I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.

  • Post Author
    rurp
    Posted February 24, 2025 at 5:50 am

    I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.

    The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.

    At that point I just went ahead and read some docs and other resources and solved things the traditional way.

    Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.

  • Post Author
    anandnair
    Posted February 24, 2025 at 5:52 am

    Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.

    This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.

    We should consider a few things before asking, "Can AI code like humans?":

    – How did AI learn to code? What structured curriculum was used?

    – Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?

    – Did the AI learn through hands-on coding or just by reading Stack Overflow?

    If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?

    Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!

    Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.

  • Post Author
    DarkmSparks
    Posted February 24, 2025 at 5:54 am

    LLMs will never solve this problem, they are basically just glorified copy & paste engines, solving real code problems requires invention, even for most basic tasks. The best they will manage in their current direct is reason they don't have the capability or capacity to actually solve the problem rather than just getting it wrong the vast majority of the time.

  • Post Author
    internet101010
    Posted February 24, 2025 at 6:04 am

    I believe it. I couldn't even get o1 or claude 3.5 to write a tampermonkey script that would turn off auto-scroll to bottom in LibreChat, even when uploading the html and javascript as context.

    Apparently it has to do with overflow anchor or something in React? Idk. I gave up.

  • Post Author
    simonw
    Posted February 24, 2025 at 6:11 am

    I find the framing of this story quite frustrating.

    The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.

    It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!

    Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.

  • Post Author
    siliconc0w
    Posted February 24, 2025 at 6:18 am

    Interesting that Claude wins despite the other models being more expensive and doing much better in the traditional benchmarks.

  • Post Author
    ChrisArchitect
    Posted February 24, 2025 at 6:25 am

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.