AI Blindspots
Blindspots in LLMs Iâve noticed while AI coding. Sonnet family emphasis. Maybe I will eventually suggest Cursor rules for these problems.
- Stop Digging
- Black Box Testing
- Preparatory Refactoring
- Stateless Tools
- Bulldozer Method
- Requirements, not Solutions
- Use Automatic Code Formatting
- Keep Files Small
- Read the Docs
- Walking Skeleton
- Use Static Types
- Use MCP Servers
- Mise en Place
- Respect the Spec
- Memento
- Scientific Debugging
- The tail wagging the dog
- Know Your Limits
- Culture Eats Strategy
- Rule of Three
This site is made with Hugo Êâ¢á´¥â¢Ê Bear.
19 Comments
ezyang
Hi Hacker News! One of the things about this blog that has gotten a bit unwieldy as I've added more entries is that it's a sort of undifferentiated pile of posts. I want some sort of organization system but I haven't found one that's good. Very open to suggestions!
datadrivenangel
Almost all of these are good things to consider with human coders as well. Product managers take note!
https://ezyang.github.io/ai-blindspots/requirements-not-solu…
fizx
The community seems rather divided as to whether these are intrinsic, or we solve these with today's tech, and more training, heuristics and workarounds.
mystified5016
Recently I've been writing a resume/hire-me website. I'm not a stellar writer, but I'm alright, so I've been asking various LLMs to review it by just dropping the HTML file in.
Every single one has completely ignored the "Welcome to nginx!" Header at the top of the page. I'd left it in half as a joke to amuse myself but I expected it would get some kind of reaction from the LLMs, even if just a "it seems you may have forgotten this line"
Kinda weird. I even tried guiding them into seeing it without explicitly mentioning it and I could not get a response.
antasvara
This highlights a thing I've seen with LLM's generally: they make different mistakes than humans. This makes catching the errors much more difficult.
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.
teraflop
> I had some test cases with hard coded numbers that had wobbled and needed updating. I simply asked the LLM to keep rerunning the test and updating the numbers as necessary.
Why not take this a step farther and incorporate this methodology directly into your test suite? Every time you push a code change, run the new version of the code and use it to automatically update the "expected" output. That way you never have to worry about failures at all!
Mc91
One thing I do is go to Leetcode, see the optimal big O time and space solutions, then give the LLM the Leetcode medium/hard problem, and limit it to the optimal big O time/space solution and suggest the method (bidirectional BFS). I ask for the solution in some fairly mainstream modern language (although not Javascript, Java or Python). I also say to do it as compact as possible. Sometimes I reiterate that.
It's just a function usually, but it does not always compile. I'd set this as a low bar for programming. We haven't even gotten into classes, architecture, badly-defined specifications and so on.
LLMs are useful for programming, but I'd want them to clear this low hurdle first.
logicchains
I found Gemini Flash Thinking Experimental is almost unusable in an agent workflow because it'll eventually accidentally remove a closing bracket, breaking compilation, and be unable to identify and fix the issue even with many attempts. Maybe it has trouble counting/matching braces due to fewer layers?
taberiand
Based on the list, LLMs are at a "very smart junior programmer" level of coding – though with a much broader knowledge base than you'd expect from even a senior. They lack bigger-picture thinking, and default to doing what is asked of them instead of what needs to be done.
I expect the models will continue improving though, I feel like most of it comes down to the ephemeral nature of their context window / the ability to recall and attach relevant information to the working context when prompted.
dataviz1000
Are you using Cursor? I'm using Github Copilot in VSCode and I'm wondering if I will get more efficiency from a different coding assistant.
boredtofears
Great read, I can definitely confirm a lot of these myself. Would be nice to see this aggregated into some kind of "best practices" document (although hard to say how quickly it'd be out of date).
submeta
> Preparatory refactoring
> Current LLMs, without a plan that says they should refactor first, don’t decompose changes in this way. They will try to do everything at once.
Just today I leaned the hard way. I had created an app for my spouse and myself for sharing and reading news-articles, some of them behind paywalls.
Using Cursor I have a FastAPI backend and a React frontend. When I added extracting the article text in markdown and then summarizing it, both using openai, and when I tasked Cursor with it, the chaos began. Cursor (with the help of Claude 3.7) tackled everything at once and some more. It started writing a module for using openai, then it also changed the frontend to not only show the title and url, but also the extracted markdown and the summary, by doing that it screwed up my UI, deleted some rows in my database, came up with as module for interacting with Openai that did not work, the ectraction was screwed, the summary as well.
All of this despite me having detailed cursorrules.
That‘s when I realized: Divide and conquer. Ask it to write one function that workd, then one class where the function becomes a method, test it, then move on to next function. Until every piece is working and I can glue them together.
colonCapitalDee
> Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. The refactor change can be quite involved, but because it is semantics preserving, it is easier to evaluate than the change itself.
> In human software engineering, a common antipattern when trying to figure out what to do is to jump straight to proposing solutions, without forcing everyone to clearly articulate what all the requirements are. Often, your problem space is constrained enough that once you write down all of the requirements, the solution is uniquely determined; without the requirements, it’s easy to devolve into a haze of arguing over particular solutions.
> When you’re learning to use a new framework or library, simple uses of the software can be done just by copy pasting code from tutorials and tweaking them as necessary. But at some point, it’s a good idea to just slog through reading the docs from top-to-bottom, to get a full understanding of what is and is not possible in the software.
> The Walking Skeleton is the minimum, crappy implementation of an end-to-end system that has all of the pieces you need. The point is to get the end-to-end system working first, and only then start improving the various pieces.
> When there is a bug, there are broadly two ways you can try to fix it. One way is to randomly try things based on vibes and hope you get lucky. The other is to systematically examine your assumptions about how the system works and figure out where reality mismatches your expectations.
> The Rule of Three in software says that you should be willing to duplicate a piece of code once, but on the third copy you should refactor. This is a refinement on DRY (Don’t Repeat Yourself) accounting for the fact that it might not necessarily be obvious how to eliminate a duplication, and waiting until the third occurrence might clarify.
These are lessons that I've learned the hard way (for some definition of "learned", these things are simple but not easy), but I've never seen them phrased to succinctly and accurately before. Well done OP!
admiralrohan
Even in the age of Vibe coding, I always try to learn as much as possible.
For example, yesterday I was working with the Animation library Motion which I never worked earlier. I used the code suggested by AI but at least picke 2-3 basic animation concepts while reviewing the code.
Kind of unfocused passive learning I always tried even before AI.
akomtu
LLMs aren't AI. They are more like librarians with eidetic memory: they can discuss in depth any book in the library, but sooner or later you notice that they don't really understand what they are talking about.
One easy test for AI-ness is the optimization problem. Give it a relatively small, but complex program, e.g. a GPU shader on shadertoy.com, and tell it to optimize it. The output is clearly defined: it's an image or an animation. It's also easy to test how much it's improved the framerate. What's good is this task won't allow the typical LLM bullshitting: if it doesn't compile or doesn't draw a correct image, you'll see it.
The thing is, the current generation of LLMs will blunder at this task.
kleton
Most of the things are applicable to the current top models, but he frequently references Claude sonnet, which is not even above the fold on the leaderboard
lukev
This is exceptionally useful advice, and precisely the way we should be talking about how to engage with LLMs when coding.
That said, I take issue with "Use Static Types".
I've actually had more success with Claude Code using Clojure than I have Typescript (the other thing I tried.)
Clojure emphasizes small, pure functions, to a high degree. Whereas (sometimes) fully understanding a strong type might involve reading several files. If I'm really good with my prompting to make sure that I have good example data for the entity types at each boundary point, it feels like it does a better job.
My intuition is that LLMs are fundamentally context-based, so they are naturally suited to an emphasis on functions over pure data, vs requiring understanding of a larger type/class hierarchy to perform well.
But it took me a while to figure out how to build these prompts and agent rules. A LLM programming in a dynamic language without a human supervising the high-level code structure and data model is a recipe for disaster.
oglop
I just talk to an LLM like it’s a person who is smart, meaning I expect it to be confidently wrong now and then but I don’t have to worry about hurting its feelings. They are remarkably similar to people, though others seems to not think that so maybe it is a case of some people finding them easier to work with compared to others. I wonder what drives that. Maybe it’s the difference between a person who thinks life unfolds before then vs the person who views life as a bundle, with each day a fold making up your experience and through this stack you discern the structure which is your life, which sure seems how these things work.
torginus
I have one more – LLMs are terrible at counting and arithmetic – if your code gen relies on cutting off the first two words of a constant string – you better check if you need to cut off 12 characters like the LLM says. If it adds 2 numbers, it might be suspect. If you need it to decode a byte sequence, where getting the numbers from the exact right position is necessary.. you get the idea.
Took me a day to debug my LLM-generated code – and of course, like all fruitless and long debugging sessions, this one started with me assuming that it can't possibly get this wrong – yet it did.