A new paper released on Friday is making waves in the AI community, not because of the model
it describes, but because it shows how close we are to some very large breakthroughs in AI. The model
is just below state of the art, but it can run on my laptop. More important, it sheds light on how all
this stuff works, and it’s not complicated.
OpenAI were the first to claim the inference-time scaling laws. Basically, an LLM can get higher performance
if it can “think” longer before answering. But, like, how do you do it? How do you make it think longer?
OpenAI and R1 had cool graphs showing performance scaling with average thinking time (this from
the s1 paper):
But how do they control the length of an LLM response? Everyone skipped over that part, but s1
shows us details, and it is fun.
Context: When an LLM “thinks” at inference time, it puts it’s thoughts inside
and
XML tags. Once it gets past the end tag the model is taught to change voice into a confident
and authoritative tone for the final answer.
In s1, when the LLM tries to stop thinking with ""
, they force it to keep going
by replacing it with "Wait"
. It’ll then begin to second guess and double check it’s answer. They
do this to trim or extend thinking time (trimming is just abruptly inserting ""
).
It’s really dumb, I love it. It feels like the kind of hack I would try.
So for o3-mini-low
versus o3-mini-high
, that’s likely how they do it. They probably trained 3
models, and with each with a different average thinking time (as measured during training). Eventually the
training process begins to encode that behavior into the model weights.
The Entropix Tie In
The trick is so dumb you can do it at inference time too. I’m kicking myself for not understanding
this earlier, because it’s what entropix is all about, and I wrote a lot about entropix.
In entropix, they look at the entropy & varentropy of the logits (and attention) to change how the
tokens are selected. In fact, they used tokens like “Wait” to f
6 Comments
bberenberg
In case you’re not sure what S1 is, here is the original paper: https://arxiv.org/html/2501.19393v1
ttyprintk
https://huggingface.co/simplescaling
yapyap
> If you believe that AI development is a prime national security advantage, then you absolutely should want even more money poured into AI development, to make it go even faster.
This, this is the problem for me with people deep in AI. They think it’s the end all be all for everything. They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.
To me, that’s stupid. I observe the more populist and socially appealing CEOs of these VC startups (Sam Altman being the biggest, of course.) just straight up lying to the masses, for financial gain, of course.
Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
and the enthusiasm of these people that are into it feeds into those who aren’t aware of it in the slightest, they see you can chat with a ‘robot’, they hear all this hype from
their peers and they buy into it. We are social creatures after all.
I think using any of this in a national security setting is stupid, wasteful and very, very insecure.
Hell, if you really care about being ahead, pour 500 billion dollars into quantum computing so u can try to break current encryption. That’ll get you so much further than this nonsensical bs.
GTP
Sorry for being lazy, but I just don't have the time right now to read the paper. Is there in the paper or somewhere else a comparison based on benchmarks of S1 vs R1 (the full R1, not quantized or distilled)?
swiftcoder
> having 10,000 H100s just means that you can do 625 times more experiments than s1 did
I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not…
HenryBemis
> Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.
(sorry for the long quote)
I will say (naively perhaps) "oh but that is fairly simple". For any API request, add a counter of 5 seconds to the next for 'unverified' users. Make the "blue check" (a-la X/Twitter). For the 'big sales' have a third-party vetting process so that if US Corporation XYZ wants access, they prove themselves worthy/not Chinese competition and then you do give them the 1000/min deal.
For everyone else, add the 5 second (or whatever other duration makes sense) timer/overhead and then see them drop from 1000 requests per minutes to 500 per day. Or just cap them at 500 per day and close that back-door. And if you get 'many cheap accounts' doing hand-overs (AccountA does 1-500, AccountB does 501-1000, AccountC does 1001-1500, and so on) then you mass block them.