
As an experienced LLM user, I don’t use generative LLMs often by minimaxir
Lately, I’ve been working on codifying a personal ethics statement about my stances on generative AI as I have been very critical about several aspects of modern GenAI, and yet I participate in it. While working on that statement, I’ve been introspecting on how I myself have been utilizing large language models for both my professional work as a Senior Data Scientist at BuzzFeed and for my personal work blogging and writing open-source software. For about a decade, I’ve been researching and developing tooling around text generation from char-rnns, to the ability to fine-tune GPT-2, to experiments with GPT-3, and even more experiments with ChatGPT and other LLM APIs. Although I don’t claim to the best user of modern LLMs out there, I’ve had plenty of experience working against the cons of next-token predictor models and have become very good at finding the pros.
It turns out, to my surprise, that I don’t use them nearly as often as people think engineers do, but that doesn’t mean LLMs are useless for me. It’s a discussion that requires case-by-case nuance.
How I Interface With LLMs
Over the years I’ve utilized all the tricks to get the best results out of LLMs. The most famous trick is prompt engineering, or the art of phrasing the prompt in a specific manner to coach the model to generate a specific constrained output. Additions to prompts such as offering financial incentives to the LLM or simply telling the LLM to make their output better do indeed have a quantifiable positive impact on both improving adherence to the original prompt and the output text quality. Whenever my coworkers ask me why their LLM output is not what they expected, I suggest that they apply more prompt engineering and it almost always fixes their issues.
No one in the AI field is happy about prompt engineering, especially myself. Attempts to remove the need for prompt engineering with more robust RLHF paradigms have only made it even more rewarding by allowing LLM developers to make use of better prompt adherence. True, “Prompt Engineer” as a job title turned out to be a meme but that’s mostly because prompt engineering is now an expected skill for anyone seriously using LLMs. Prompt engineering works, and part of being a professional is using what works even if it’s silly.
To that end, I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary. Accessing LLM APIs like the ChatGPT API directly allow you to set system prompts which control the “rules” for the generation that can be very nuanced. Specifying specific constraints for the generated text such as “keep it to no more than 30 words” or “never use the word ‘delve’” tends to be more effective in the system prompt than putting them in the user prompt as you would with ChatGPT.com. Any modern LLM interface that does not let you explicitly set a system prompt is most likely using their own system prompt which you can’t control: for example, when ChatGPT.com had an issue where it was too sycophantic to its users, OpenAI changed the system prompt to command ChatGPT to “avoid ungrounded or sycophantic flattery.” I tend to use Anthropic Claude’s API — Claude Sonnet in particular — more than any ChatGPT variant because Claude anecdotally is less “robotic” and also handles coding questions much more accurately.
Additionally with the APIs, you can control the “temperature” of the generation, which at a high level controls the creativity of the generation. LLMs by default do not select the next token with the highest probability in order to allow it to give different outputs for each generation, so I prefer to set the temperature to 0.0
so that the output is mostly deterministic, or 0.2 - 0.3
if some light variance is required. Modern LLMs now use a default temperature of 1.0
, and I theorize that higher value is accentuating LLM hallucination issues where the text outputs are internally consistent but factually wrong.
LLMs for Professional Problem Solving!
With that pretext, I can now talk about how I have used generative LLMs over the past couple years at BuzzFeed. Here are outlines of some (out of many) projects I’ve worked on using LLMs to successfully solve problems quickly:
- BuzzFeed site curators developed a new hierarchal taxonomy to organize thousands of articles into a specified category and subcategory. Since we had no existing labeled articles to train a traditional multiclass classification model to predict these new labels, I wrote a script to hit the Claude Sonnet API with a system prompt saying
The following is a taxonomy: return the category and subcategory that best matches the article the user provides.
plus the JSON-formatted hierarchical taxonomy, then I provided the article metadata as the user prompt, all with a temperature of0.0
for the most precise results. Running this in a loop for all the articles resulted in appropriate labels. - After identifying hundreds of distinct semantic clusters of BuzzFeed articles using data science shenanigans, it became clear that there wasn’t an easy way to give each one unique labels. I wrote another script to hit the Claude Sonnet API with a system prompt saying
Return a JSON-formatted title and description that applies to all the articles the user provides.
with the user prompt containing five articles from that cluster: again, running the script in a loop for all clusters provided excellent results. - One BuzzFeed writer asked if there was a way to use a LLM to sanity-check grammar questions such as “should I use an em dash here?” against the BuzzFeed style guide. Once again I hit the Claude Sonnet API, this time copy/pasting the full style guide in the system prompt plus a command to
Reference the provided style guide to answer the user's question, and cite the exact rules used to answer the question.
In testing, the citations were accurate and present in the source input, and the reasonings were consistent.
Each of these projects were off-hand ideas pitched in a morning standup or a Slack DM, and yet each project only took an hour or two to complete a proof of concept (including testing) and hand of
23 Comments
rfonseca
This was an interesting quote from the blog post: "There is one silly technique I discovered to allow a LLM to improve my writing without having it do my writing: feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post."
Jerry2
> I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality
Hey Max, do you use a custom wrapper to interface with the API or is there some already established client you like to use?
If anyone else has a suggestion please let me know too.
andy99
Re vibe coding, I agree with your comments but where I've used it is when I needed to mock up a UI or a website. I have no front end experience so making a 80% (probably 20%) but live demo is still a valuable thing, to show to others to get the point across, obviously not to deploy. It's a replacement for drawing a picture of what I think the UI should look like. I feel like this is an under-appreciated use. LLM coding is not remotely ready for real products but it's great for mock-ups that further internal discussions.
Oras
JSON response doesn’t always work as expected unless you have few items to return. In Max’s example it’s classification.
For anyone trying to return consistent json, checkout structured data where you define a json schema with required field and that would return the same structure all the time.
I have tested it with high success using GPT-4o-mini.
behnamoh
> I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary.
Yes, I also often use the "studio" of each LLM for better results because in my experience OpenAI "nerfs" models on the ChatGPT UI (models keep forgetting things—probably a limited context length set by OpenAI to reduce costs, generally the model is less chatty (again, probably to reduce their costs), etc. But I've noticed Gemini 2.5 Pro is the same on the studio and the Gemini app.
> Any modern LLM interface that does not let you explicitly set a system prompt is most likely using their own system prompt which you can’t control: for example, when ChatGPT.com had an issue where…
ChatGPT does have system prompts but Claude doesn't (one of its many, many UI shortcomings which Anthropic never addressed).
That said, I've found system prompts less and less useful with newer models. I can simply preface my own prompt with the instructions and the model follows them very well.
> Specifying specific constraints for the generated text such as “keep it to no more than 30 words” or “never use the word ‘delve’” tends to be more effective in the system prompt than putting them in the user prompt as you would with ChatGPT.com.
I get that LLMs have a vague idea of how many words are 30 words, but they never do a good job in these tasks for me.
ttoinou
Side topic : I didn’t see a serious article about prompt engineering for senior software development pop up on HN. Yet a lot of users here have their own techniques unshared with others
thefourthchime
[flagged]
Legend2440
>Discourse about LLMs and their role in society has become bifuricated enough such that making the extremely neutral statement that LLMs have some uses is enough to justify a barrage of harrassment.
Honestly true and I’m sick of it.
A very vocal group of people are convinced AI is a scheme by the evil capitalists to make you train your own replacement. The discussion gets very emotional very quickly because they feel personally threatened by the possibility that AI is actually useful.
tptacek
There's a thru-line to commentary from experienced programmers on working with LLMs, and it's confusing to me:
Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying.
The post does later touch on coding agents (Max doesn't use them because "they're distracting", which, as a person who can't even stand autocomplete, is a position I'm sympathetic to), but still: coding agents solve the core problem he just described. "Raw" LLMs set loose on coding tasks throwing code onto a blank page hallucinate stuff. But agenty LLM configurations aren't just the LLM; they're also code that structures the LLM interactions. When the LLM behind a coding agent hallucinates a function, the program doesn't compile, the agent notices it, and the LLM iterates. You don't even notice it's happening unless you're watching very carefully.
lxe
This article reads like "I'm not like other LLM users" tech writing. There are good points about when LLMs are actually useful vs. overhyped, but the contrarian framing undermines what could have been straightforward practical advice. The whole "I'm more discerning than everyone else" positioning gets tiresome in tech discussions, especially when the actual content is useful.
qoez
I've tried it out a ton but the only thing I end up using it for these days is teaching me new things (which I largely implement myself; it can rarely one-shot it anyway). Or occasionally to make short throwaway scripts to do like file handling or ffmpeg.
danbrooks
As a data scientist, this mirrors my experience. Prompt engineering is surprisingly important for getting expected output – and use LLM POCs have quick turnaround times.
Snuggly73
Emmm… why has Claude 'improved' the code by setting SQLite to be threadsafe and then adding locks on every db operation? (You can argue that maybe the callbacks are invoked from multiple threads, but they are not thread safe themselves).
iambateman
> "feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post."
It feels weird to write something positive here…given the context…but this is a great idea. ;)
justlikereddit
[flagged]
simonw
> However, for more complex code questions particularly around less popular libraries which have fewer code examples scraped from Stack Overflow and GitHub, I am more cautious of the LLM’s outputs.
That's changed for me in the past couple of months. I've been using the ChatGPT interface to o3 and o4-mini for a bunch of code questions against more recent libraries and finding that they're surprisingly good at using their search tool to look up new details. Best version of that so far:
"This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it."
This actually worked! https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la…
The other trick I've been using a lot is pasting the documentation or even the entire codebase of a new library directly into a long context model as part of my prompt. This works great for any library under about 50,000 tokens total – more than that and you usually have to manually select the most relevant pieces, though Gemini 2.5 Pro can crunch through hundreds of thousands of tokens pretty well with getting distracted.
Here's an example of that from yesterday: https://simonwillison.net/2025/May/5/llm-video-frames/#how-i…
Beijinger
"To that end, I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary."
How do you do this? Do you have to be on a paid plan for this?
morgengold
… but when I do, I let it write regex, SQL commands, simple/complex if else stuff, apply tailwind classes, feed it my console log errors, propose frontend designs … and other little stuff. Saves brain power for the complex problems.
jrflowers
[flagged]
gcp123
While I think the title is misleading/clickbaity (no surprise given the buzzfeed connection), I'll say that the substance of the article might be one of the most honest take on LLMs I've seen from someone who actually works in the field. The author describes exactly how I use LLMs – strategically, for specific tasks where they add value, not as a replacement for actual thinking.
What resonated most was the distinction between knowing when to force the square peg through the round hole vs. when precision matters. I've found LLMs incredibly useful for generating regex (who hasn't?) and solving specific coding problems with unusual constraints, but nearly useless for my data visualization work.
The part about using Claude to generate simulated HN criticism of drafts is brilliant – getting perspective without the usual "this is amazing!" LLM nonsense. That's the kind of creative tool use that actually leverages what these models are good at.
I'm skeptical about the author's optimism regarding open-source models though. While Qwen3 and DeepSeek are impressive, the infrastructure costs for running these at scale remain prohibitive for most use cases. The economics still don't work.
What's refreshing is how the author avoids both the "AGI will replace us all" hysteria and the "LLMs are useless toys" dismissiveness. They're just tools – sometimes useful, sometimes not, always imperfect.
geor9e
>Ridiculous headline implying the existance of non-generative LLMs
>Baited into clicking
>Article about generative LLMs
>It's a buzzfeed employee
ziml77
I like that the author included the chat logs. I know there's a lot of times where people can't share them because they'd expose too much info, but I really think it's important when people make big claims about what they've gotten an LLM to do that they back it up.
resource_waste
[flagged]