OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat “low-level” software engineers by the end of this year.
In a new paper, the company’s researchers found that even frontier models, or the most advanced and boundary-pushing AI systems, “are still unable to solve the majority” of coding tasks.
The researchers used a newly-developed benchmark called SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork. Using the benchmark, OpenAI put three large language models (LLMs) — its own o1 reasoning model and flagship GPT-4o, as well as Anthropic’s Claude 3.5 Sonnet — to the test.
Specifically, the new benchmark evaluated how well the LLMs performed with two types of tasks from Upw
27 Comments
chasing0entropy
The models were restricted from accessing the internet and forced to develop their own solutions internally.
I think researchers will find that human coders are unable to solve most coding problems without access to the internet.
aszantu
it's so much easier to learn from examples than from documentation in my opinion, documentation is, what I use when I want to know additional parameters or downsides of a functionality. I'm no coder though.
_def
> even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.
"low/high level" starts to lose its meaning to me because it gets used in opposite ways
nostrebored
This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.
Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.
But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.
The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.
And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.
WiSaGaN
So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.
pton_xd
Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.
mrayycombi
Despite the lack luster coding performance, AI has PROVEN its able to provide a rationale for profit taking job cuts, layoffs, reduced stock grants, and increased executive bonuses.
So it's not ALL bad news.
marban
I prompted up a very basic Flask scaffold via Windsurf and once it reached a certain code size, it just started to remove or weirdly rewrite old parts to handle the context. ("You're right let's move that back in"). Didn't end well.
blindriver
They should feed it bootcamp study materials and Cracking the Coding Interview book in order to improve its ability to code.
axelfontaine
… so far
jasonthorsness
Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.
realitysballs
I believe the outcome of this type of article is actually positive. The ‘SWE-Lancer’ benchmark provides visibility into a more pragmatic assessment of LLM capabilities.
Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.
rvz
The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.
Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.
If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?
m3kw9
It solved a lot of mines
spartanatreyu
Link to the original paper: https://arxiv.org/pdf/2502.12115
TL;DR:
They tested with programming tasks and manager's tasks.
The vast majority of tasks given require bugfixes.
Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.
The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)
Personally, I have other concerns:
– A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through
– An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.
– LLM use increases code churn in a code base
– Increased code churn is known to be bad the health of projects
jr-ai-interview
This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.
ldjkfkdsjnv
I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.
The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.
Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.
yieldcrv
I saw this
still, chain of thought is great for LeetCode 75
Since interviewers “want to see how you think” (and get the right answer in less time than other candidates on average)
I can now see how you’re supposed to think (and get the right answer in less time than other candidates on average, for now)
mohsen1
I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality
I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.
pzo
> The models weren't allowed to access the internet
How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.
rurp
I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.
anandnair
Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.
This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.
We should consider a few things before asking, "Can AI code like humans?":
– How did AI learn to code? What structured curriculum was used?
– Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?
– Did the AI learn through hands-on coding or just by reading Stack Overflow?
If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?
Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!
Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.
DarkmSparks
LLMs will never solve this problem, they are basically just glorified copy & paste engines, solving real code problems requires invention, even for most basic tasks. The best they will manage in their current direct is reason they don't have the capability or capacity to actually solve the problem rather than just getting it wrong the vast majority of the time.
internet101010
I believe it. I couldn't even get o1 or claude 3.5 to write a tampermonkey script that would turn off auto-scroll to bottom in LibreChat, even when uploading the html and javascript as context.
Apparently it has to do with overflow anchor or something in React? Idk. I gave up.
simonw
I find the framing of this story quite frustrating.
The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.
It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!
Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.
siliconc0w
Interesting that Claude wins despite the other models being more expensive and doing much better in the traditional benchmarks.
ChrisArchitect
Previously on source: https://news.ycombinator.com/item?id=43086347