Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

Share This Article

Sed ut perspiciatis unde.

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors — these can’t be damped mathematically.

In exchange, he offers the idea that we should have something that is an ‘energy minimization’ architecture; as I understand it, this would have a concept of the ‘energy’ of an entire response, and training would try and minimize that.

Which is to say, I don’t fully understand this. That said, I’m curious to hear what ML researchers think about Lecun’s take, and if there’s any engineering done around it. I can’t find much after the release of ijepa from his group.

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (25)

25 Comments

Post Author

ActorNightly

Posted March 10, 2025 at 8:26 pm

Not an official ML researcher, but I do happen to understand this stuff.

The problem with LLMs is that the output is inherently stochastic – i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

0Likes Log in to Reply
Post Author

stevenae

Posted March 12, 2025 at 10:25 pm

https://en.m.wikipedia.org/wiki/Energy-based_model

0Likes Log in to Reply
Post Author

jawiggins

Posted March 14, 2025 at 5:57 pm

I'm not an ML researcher, but I do work in the field.

My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.

I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.

Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.

I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.

[1]: https://www.open.edu/openlearn/nature-environment/organisati…

0Likes Log in to Reply
Post Author

TrainedMonkey

Posted March 14, 2025 at 6:03 pm

This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

0Likes Log in to Reply
Post Author

ALittleLight

Posted March 14, 2025 at 6:05 pm

I've never understood this critique. Models have the capability to say: "oh, I made a mistake here, let me change this" and that solves the issue, right?

A little bit of engineering and fine tuning – you could imagine a model producing a sequence of statements, and reflecting on the sequence – updating things like "statement 7, modify: xzy to xyz"

0Likes Log in to Reply
Post Author

killthebuddha

Posted March 14, 2025 at 6:12 pm

I've always felt like the argument is super flimsy because "of course we can _in theory_ do error correction". I've never seen even a semi-rigorous argument that error correction is _theoretically_ impossible. Do you have a link to somewhere where such an argument is made?

0Likes Log in to Reply
Post Author

blueyes

Posted March 14, 2025 at 6:17 pm

Sincere question – why doesn't RL-based fine-tuning on top of LLMs solve this or at least push accuracy above a minimum acceptable threshhold in many use cases? OAI has a team doing this for enterprise clients. Several startups rolling out of current YC batch are doing versions of this.

0Likes Log in to Reply
Post Author

__rito__

Posted March 14, 2025 at 6:23 pm

Sligtly related: Energy Based Models (EBMs) are better in theory and yet too resource intensive. I tried to sell using EBMs to my org, but the price for even a small use case was prohibitive.

I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo…

Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.

Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.

Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.

There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.

There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.

He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.

Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.

0Likes Log in to Reply
Post Author

hnfong

Posted March 14, 2025 at 6:29 pm

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

0Likes Log in to Reply
Post Author

estebarb

Posted March 14, 2025 at 6:34 pm

I have no idea about EBM, but I have researched a bit on the language modelling side. And let's be honest, GPT is not the best learner we can create right now (ourselves). GPT needs far more data and energy than a human, so clearly there is a better architecture somewhere waiting to be discovered.

Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.

Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.

Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.

0Likes Log in to Reply
Post Author

probably_wrong

Posted March 14, 2025 at 6:38 pm

I haven't read Yann Lecun's take. Based on your description alone my first impression would be: there's a paper [1] arguing that "beam search enforces uniform information density in text, a property motivated by cognitive science". UID claims, in short, that a speaker only delivers as much content as they think the listener can take (no more, no less) and the paper claims that beam search enforced this property at generation time.

The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.

Then again, perhaps they have one in mind and I just haven't read it.

[1] https://aclanthology.org/2020.emnlp-main.170/

0Likes Log in to Reply
Post Author

tyronehed

Posted March 14, 2025 at 6:40 pm

Any transformer based LLM will never achieve AGI because it's only trying to pick the next word. You need a larger amount of planning to achieve AGI.
Also, the characteristics of LLMs do not resemble any existing intelligence that we know of. Does a baby require 2 years of statistical analysis to become useful? No. Transformer architectures are parlor tricks. They are glorified Google but they're not doing anything or planning.
If you want that, then you have to base your architecture on the known examples of intelligence that we are aware of in the universe. And that's not a transformer. In fact, whatever AGI emerges will absolutely not contain a transformer.

0Likes Log in to Reply
Post Author

tyronehed

Posted March 14, 2025 at 6:42 pm

The alternative architectures must learn from streaming data, must be error tolerant and must have the characteristic that similar objects or concepts much naturally come near to each other. They must naturally overlap.

0Likes Log in to Reply
Post Author

bitwize

Posted March 14, 2025 at 7:04 pm

Ever hear of Dissociated Press? If not, try the following demonstration.

Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.

Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.

LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.

What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles — enough to form many different responses — and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.

0Likes Log in to Reply
Post Author

janalsncm

Posted March 14, 2025 at 7:14 pm

I am an MLE not an expert. However, it is a fundamental problem that our current paradigm of training larger and larger LLMs cannot ever scale to the precision people require for many tasks. Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.

https://arxiv.org/pdf/2402.04494

0Likes Log in to Reply
Post Author

jurschreuder

Posted March 14, 2025 at 7:24 pm

This concept comes from Hopfield networks.

If two nodes are on, but the connection between them is negative, this causes energy to be higher.

If one of those nodes switches off, energy is reduced.

With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".

All you can do then is try to get the energy as low as possible.

This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.

0Likes Log in to Reply
Post Author

d--b

Posted March 14, 2025 at 7:44 pm

Well, it could be argued that the “optimal response” ie the one that sorta minimizes that “energy” is sorted by LLMs on the first iteration. And further iterations aren’t adding any useful information and in fact are countless occasions to veer off the optimal response.

For example if a prompt is: “what is the Statue of Liberty”, the LLMs first output token is going to be “the”, but it kinda already “knows” that the next ones are going to be “statue of liberty”.

So to me LLMs already “choose” a response path from the first token.

Conversely, a LLM that would try and find a minimum energy for the whole response wouldn’t necessarily stop hallucinating. There is nothing in the training of a model that says that “I don’t know” has a lower “energy” than a wrong answer…

0Likes Log in to Reply
Post Author

rglover

Posted March 14, 2025 at 8:06 pm

Not an ML researcher, but implementing these systems has shown this opinion to be correct. The non-determinism of LLMs is a feature, not a bug that can be fixed.

As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.

There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.

0Likes Log in to Reply
Post Author

EEgads

Posted March 14, 2025 at 8:27 pm

Yann LeCun understands this is an electrical engineering and physical statistics of machine problem and not a code problem.

The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.

The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.

Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.

0Likes Log in to Reply
Post Author

bobosha

Posted March 14, 2025 at 8:39 pm

I argue that JEPA and its Energy-Based Model (EBM) framework fail to capture the deeply intertwined nature of learning and prediction in the human brain—the “yin and yang” of intelligence. Contemporary machine learning approaches remain heavily reliant on resource-intensive, front-loaded training phases. I advocate for a paradigm shift toward seamlessly integrating training and prediction, aligning with the principles of online learning.

Disclosure: I am the author of this paper.

Reference:
(PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh… [accessed Mar 14, 2025].

0Likes Log in to Reply
Post Author

inimino

Posted March 14, 2025 at 8:47 pm

I have a paper coming up that I modestly hope will clarify some of this.

The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.

0Likes Log in to Reply
Post Author

snats

Posted March 14, 2025 at 9:20 pm

Not an insider but imo the work on diffusion language models like LLaDA is really exciting. It's pretty obvious that LLMs are good but they are pretty slow. And in a world where people want agents you want a lot of the time something that might not be that smart but is capable of going really fast + searches fast. You only need to solve search in a specific domain for most agents. You don't need to solve the entire knowledge of human history in a single set of weights

0Likes Log in to Reply
Post Author

eximius

Posted March 14, 2025 at 9:31 pm

I believe that so long as weights are fixed at inference time, we'll be at a dead end.

Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.

Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.

0Likes Log in to Reply
Post Author

simne

Posted March 14, 2025 at 9:32 pm

I'm not deep researcher, more like amateur, but could explain some things.

Most problem with current approach, to grow abilities, need to add more neurons, but this is not just energy consuming, but also knowledge consuming, mean, at GPT-4 level all text sources of humanity already exhausted and model become essentially overfitted. So looks like multi-modal models appear not because so good, but because they could learn on additional sources (audio/video).

I seen few approaches to overcome problem of overfitting, but as I understand not exist universal solution.

For example, tried approach to create from current texts some synthetic training data, but this idea is limited by definition.

So, current LLMs appear to hit dead end, and researchers now trying to find exit from this dead end. I believe, nearest years somebody will invent some universal solution (probably, complex of approaches) or suggest another architecture, and progress of AI will continue.

0Likes Log in to Reply
Post Author

giantg2

Posted March 14, 2025 at 9:34 pm

I feel like some hallucinations aren't bad. Isn't that basically what a new idea is – a hallucination of what could be? The ability to come up with new things, even if they're sometimes wrong, can be useful and happen all the time with humans.

0Likes Log in to Reply

Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

Share This Article

Newsletter

HackTech

25 Comments

ActorNightly

stevenae

jawiggins

TrainedMonkey

ALittleLight

killthebuddha

blueyes

rito

hnfong

estebarb

probably_wrong

tyronehed

tyronehed

bitwize

janalsncm

jurschreuder

d--b

rglover

EEgads

bobosha

inimino

snats

eximius

simne

giantg2

Leave a comment Cancel reply

Editor's Choice

Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

Share This Article

Newsletter

25 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter