AI Blindspots – Blindspots in LLMs I’ve noticed while AI coding by rahimnathwani

ByHackTech 1 day ago

19Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

AI Blindspots | AI Blindspots

AI Blindspots

Blindspots in LLMs Iâve noticed while AI coding. Sonnet family emphasis. Maybe I will eventually suggest Cursor rules for these problems.

This site is made with Hugo Êâ¢á´¥â¢Ê Bear.

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (19)

19 Comments

Post Author

ezyang

Posted March 19, 2025 at 6:34 pm

Hi Hacker News! One of the things about this blog that has gotten a bit unwieldy as I've added more entries is that it's a sort of undifferentiated pile of posts. I want some sort of organization system but I haven't found one that's good. Very open to suggestions!

0Likes Log in to Reply
Post Author

datadrivenangel

Posted March 19, 2025 at 6:44 pm

Almost all of these are good things to consider with human coders as well. Product managers take note!

https://ezyang.github.io/ai-blindspots/requirements-not-solu…

0Likes Log in to Reply
Post Author

fizx

Posted March 19, 2025 at 6:48 pm

The community seems rather divided as to whether these are intrinsic, or we solve these with today's tech, and more training, heuristics and workarounds.

0Likes Log in to Reply
Post Author

mystified5016

Posted March 19, 2025 at 7:07 pm

Recently I've been writing a resume/hire-me website. I'm not a stellar writer, but I'm alright, so I've been asking various LLMs to review it by just dropping the HTML file in.

Every single one has completely ignored the "Welcome to nginx!" Header at the top of the page. I'd left it in half as a joke to amuse myself but I expected it would get some kind of reaction from the LLMs, even if just a "it seems you may have forgotten this line"

Kinda weird. I even tried guiding them into seeing it without explicitly mentioning it and I could not get a response.

0Likes Log in to Reply
Post Author

antasvara

Posted March 19, 2025 at 7:16 pm

This highlights a thing I've seen with LLM's generally: they make different mistakes than humans. This makes catching the errors much more difficult.

What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.

LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.

0Likes Log in to Reply
Post Author

teraflop

Posted March 19, 2025 at 8:00 pm

> I had some test cases with hard coded numbers that had wobbled and needed updating. I simply asked the LLM to keep rerunning the test and updating the numbers as necessary.

Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!

0Likes Log in to Reply
Post Author

Mc91

Posted March 19, 2025 at 8:01 pm

One thing I do is go to Leetcode, see the optimal big O time and space solutions, then give the LLM the Leetcode medium/hard problem, and limit it to the optimal big O time/space solution and suggest the method (bidirectional BFS). I ask for the solution in some fairly mainstream modern language (although not Javascript, Java or Python). I also say to do it as compact as possible. Sometimes I reiterate that.

It's just a function usually, but it does not always compile. I'd set this as a low bar for programming. We haven't even gotten into classes, architecture, badly-defined specifications and so on.

LLMs are useful for programming, but I'd want them to clear this low hurdle first.

0Likes Log in to Reply
Post Author

logicchains

Posted March 19, 2025 at 8:18 pm

I found Gemini Flash Thinking Experimental is almost unusable in an agent workflow because it'll eventually accidentally remove a closing bracket, breaking compilation, and be unable to identify and fix the issue even with many attempts. Maybe it has trouble counting/matching braces due to fewer layers?

0Likes Log in to Reply
Post Author

taberiand

Posted March 19, 2025 at 8:32 pm

Based on the list, LLMs are at a "very smart junior programmer" level of coding – though with a much broader knowledge base than you'd expect from even a senior. They lack bigger-picture thinking, and default to doing what is asked of them instead of what needs to be done.

I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.

0Likes Log in to Reply
Post Author

dataviz1000

Posted March 19, 2025 at 8:35 pm

Are you using Cursor? I'm using Github Copilot in VSCode and I'm wondering if I will get more efficiency from a different coding assistant.

0Likes Log in to Reply
Post Author

boredtofears

Posted March 19, 2025 at 8:38 pm

Great read, I can definitely confirm a lot of these myself. Would be nice to see this aggregated into some kind of "best practices" document (although hard to say how quickly it'd be out of date).

0Likes Log in to Reply
Post Author

submeta

Posted March 19, 2025 at 9:25 pm

> Preparatory refactoring

> Current LLMs, without a plan that says they should refactor first, don’t decompose changes in this way. They will try to do everything at once.

Just today I leaned the hard way. I had created an app for my spouse and myself for sharing and reading news-articles, some of them behind paywalls.

Using Cursor I have a FastAPI backend and a React frontend. When I added extracting the article text in markdown and then summarizing it, both using openai, and when I tasked Cursor with it, the chaos began. Cursor (with the help of Claude 3.7) tackled everything at once and some more. It started writing a module for using openai, then it also changed the frontend to not only show the title and url, but also the extracted markdown and the summary, by doing that it screwed up my UI, deleted some rows in my database, came up with as module for interacting with Openai that did not work, the ectraction was screwed, the summary as well.

All of this despite me having detailed cursorrules.

That‘s when I realized: Divide and conquer. Ask it to write one function that workd, then one class where the function becomes a method, test it, then move on to next function. Until every piece is working and I can glue them together.

0Likes Log in to Reply
Post Author

colonCapitalDee

Posted March 19, 2025 at 9:36 pm

> Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. The refactor change can be quite involved, but because it is semantics preserving, it is easier to evaluate than the change itself.

> In human software engineering, a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are. Often, your problem space is constrained enough that once you write down all of the requirements, the solution is uniquely determined; without the requirements, it’s easy to devolve into a haze of arguing over particular solutions.

> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary. But at some point, it’s a good idea to just slog through reading the docs from top-to-bottom, to get a full understanding of what is and is not possible in the software.

> The Walking Skeleton is the minimum, crappy implementation of an end-to-end system that has all of the pieces you need. The point is to get the end-to-end system working first, and only then start improving the various pieces.

> When there is a bug, there are broadly two ways you can try to fix it. One way is to randomly try things based on vibes and hope you get lucky. The other is to systematically examine your assumptions about how the system works and figure out where reality mismatches your expectations.

> The Rule of Three in software says that you should be willing to duplicate a piece of code once, but on the third copy you should refactor. This is a refinement on DRY (Don’t Repeat Yourself) accounting for the fact that it might not necessarily be obvious how to eliminate a duplication, and waiting until the third occurrence might clarify.

These are lessons that I've learned the hard way (for some definition of "learned", these things are simple but not easy), but I've never seen them phrased to succinctly and accurately before. Well done OP!

0Likes Log in to Reply
Post Author

admiralrohan

Posted March 19, 2025 at 9:54 pm

Even in the age of Vibe coding, I always try to learn as much as possible.

For example, yesterday I was working with the Animation library Motion which I never worked earlier. I used the code suggested by AI but at least picke 2-3 basic animation concepts while reviewing the code.

Kind of unfocused passive learning I always tried even before AI.

0Likes Log in to Reply
Post Author

akomtu

Posted March 19, 2025 at 10:13 pm

LLMs aren't AI. They are more like librarians with eidetic memory: they can discuss in depth any book in the library, but sooner or later you notice that they don't really understand what they are talking about.

One easy test for AI-ness is the optimization problem. Give it a relatively small, but complex program, e.g. a GPU shader on shadertoy.com, and tell it to optimize it. The output is clearly defined: it's an image or an animation. It's also easy to test how much it's improved the framerate. What's good is this task won't allow the typical LLM bullshitting: if it doesn't compile or doesn't draw a correct image, you'll see it.

The thing is, the current generation of LLMs will blunder at this task.

0Likes Log in to Reply
Post Author

kleton

Posted March 19, 2025 at 10:49 pm

Most of the things are applicable to the current top models, but he frequently references Claude sonnet, which is not even above the fold on the leaderboard

0Likes Log in to Reply
Post Author

lukev

Posted March 19, 2025 at 10:55 pm

This is exceptionally useful advice, and precisely the way we should be talking about how to engage with LLMs when coding.

That said, I take issue with "Use Static Types".

I've actually had more success with Claude Code using Clojure than I have Typescript (the other thing I tried.)

Clojure emphasizes small, pure functions, to a high degree. Whereas (sometimes) fully understanding a strong type might involve reading several files. If I'm really good with my prompting to make sure that I have good example data for the entity types at each boundary point, it feels like it does a better job.

My intuition is that LLMs are fundamentally context-based, so they are naturally suited to an emphasis on functions over pure data, vs requiring understanding of a larger type/class hierarchy to perform well.

But it took me a while to figure out how to build these prompts and agent rules. A LLM programming in a dynamic language without a human supervising the high-level code structure and data model is a recipe for disaster.

0Likes Log in to Reply
Post Author

oglop

Posted March 19, 2025 at 10:59 pm

I just talk to an LLM like it’s a person who is smart, meaning I expect it to be confidently wrong now and then but I don’t have to worry about hurting its feelings. They are remarkably similar to people, though others seems to not think that so maybe it is a case of some people finding them easier to work with compared to others. I wonder what drives that. Maybe it’s the difference between a person who thinks life unfolds before then vs the person who views life as a bundle, with each day a fold making up your experience and through this stack you discern the structure which is your life, which sure seems how these things work.

0Likes Log in to Reply
Post Author

torginus

Posted March 19, 2025 at 11:31 pm

I have one more – LLMs are terrible at counting and arithmetic – if your code gen relies on cutting off the first two words of a constant string – you better check if you need to cut off 12 characters like the LLM says. If it adds 2 numbers, it might be suspect. If you need it to decode a byte sequence, where getting the numbers from the exact right position is necessary.. you get the idea.

Took me a day to debug my LLM-generated code – and of course, like all fruitless and long debugging sessions, this one started with me assuming that it can't possibly get this wrong – yet it did.

0Likes Log in to Reply

AI Blindspots – Blindspots in LLMs I’ve noticed while AI coding by rahimnathwani

AI Blindspots – Blindspots in LLMs I’ve noticed while AI coding by rahimnathwani

Share This Article

Newsletter

AI Blindspots

HackTech

19 Comments

ezyang

datadrivenangel

fizx

mystified5016

antasvara

teraflop

Mc91

logicchains

taberiand

dataviz1000

boredtofears

submeta

colonCapitalDee

admiralrohan

akomtu

kleton

lukev

oglop

torginus

Leave a comment Cancel reply

Editor's Choice

AI Blindspots – Blindspots in LLMs I’ve noticed while AI coding by rahimnathwani

AI Blindspots – Blindspots in LLMs I’ve noticed while AI coding by rahimnathwani

Share This Article

Newsletter

19 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter