Skip to content Skip to footer
0 items - $0.00 0

Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

Ask HN: Any insider takes on Yann LeCun’s push against current architectures? by vessenes

25 Comments

  • Post Author
    ActorNightly
    Posted March 10, 2025 at 8:26 pm

    Not an official ML researcher, but I do happen to understand this stuff.

    The problem with LLMs is that the output is inherently stochastic – i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.

    Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

  • Post Author
    jawiggins
    Posted March 14, 2025 at 5:57 pm

    I'm not an ML researcher, but I do work in the field.

    My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.

    I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.

    Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.

    I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.

    [1]: https://www.open.edu/openlearn/nature-environment/organisati…

  • Post Author
    TrainedMonkey
    Posted March 14, 2025 at 6:03 pm

    This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

  • Post Author
    ALittleLight
    Posted March 14, 2025 at 6:05 pm

    I've never understood this critique. Models have the capability to say: "oh, I made a mistake here, let me change this" and that solves the issue, right?

    A little bit of engineering and fine tuning – you could imagine a model producing a sequence of statements, and reflecting on the sequence – updating things like "statement 7, modify: xzy to xyz"

  • Post Author
    killthebuddha
    Posted March 14, 2025 at 6:12 pm

    I've always felt like the argument is super flimsy because "of course we can _in theory_ do error correction". I've never seen even a semi-rigorous argument that error correction is _theoretically_ impossible. Do you have a link to somewhere where such an argument is made?

  • Post Author
    blueyes
    Posted March 14, 2025 at 6:17 pm

    Sincere question – why doesn't RL-based fine-tuning on top of LLMs solve this or at least push accuracy above a minimum acceptable threshhold in many use cases? OAI has a team doing this for enterprise clients. Several startups rolling out of current YC batch are doing versions of this.

  • Post Author
    __rito__
    Posted March 14, 2025 at 6:23 pm

    Sligtly related: Energy Based Models (EBMs) are better in theory and yet too resource intensive. I tried to sell using EBMs to my org, but the price for even a small use case was prohibitive.

    I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo…

    Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.

    Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.

    Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.

    There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.

    There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.

    He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.

    Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.

  • Post Author
    hnfong
    Posted March 14, 2025 at 6:29 pm

    I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

    https://arxiv.org/abs/2502.09992

    https://www.inceptionlabs.ai/news

    (these are results from two different teams/orgs)

    It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

  • Post Author
    estebarb
    Posted March 14, 2025 at 6:34 pm

    I have no idea about EBM, but I have researched a bit on the language modelling side. And let's be honest, GPT is not the best learner we can create right now (ourselves). GPT needs far more data and energy than a human, so clearly there is a better architecture somewhere waiting to be discovered.

    Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.

    Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.

    Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.

  • Post Author
    probably_wrong
    Posted March 14, 2025 at 6:38 pm

    I haven't read Yann Lecun's take. Based on your description alone my first impression would be: there's a paper [1] arguing that "beam search enforces uniform information density in text, a property motivated by cognitive science". UID claims, in short, that a speaker only delivers as much content as they think the listener can take (no more, no less) and the paper claims that beam search enforced this property at generation time.

    The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.

    Then again, perhaps they have one in mind and I just haven't read it.

    [1] https://aclanthology.org/2020.emnlp-main.170/

  • Post Author
    tyronehed
    Posted March 14, 2025 at 6:40 pm

    Any transformer based LLM will never achieve AGI because it's only trying to pick the next word. You need a larger amount of planning to achieve AGI.
    Also, the characteristics of LLMs do not resemble any existing intelligence that we know of. Does a baby require 2 years of statistical analysis to become useful? No. Transformer architectures are parlor tricks. They are glorified Google but they're not doing anything or planning.
    If you want that, then you have to base your architecture on the known examples of intelligence that we are aware of in the universe. And that's not a transformer. In fact, whatever AGI emerges will absolutely not contain a transformer.

  • Post Author
    tyronehed
    Posted March 14, 2025 at 6:42 pm

    The alternative architectures must learn from streaming data, must be error tolerant and must have the characteristic that similar objects or concepts much naturally come near to each other. They must naturally overlap.

  • Post Author
    bitwize
    Posted March 14, 2025 at 7:04 pm

    Ever hear of Dissociated Press? If not, try the following demonstration.

    Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.

    Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.

    LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.

    What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles — enough to form many different responses — and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.

  • Post Author
    janalsncm
    Posted March 14, 2025 at 7:14 pm

    I am an MLE not an expert. However, it is a fundamental problem that our current paradigm of training larger and larger LLMs cannot ever scale to the precision people require for many tasks. Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.

    https://arxiv.org/pdf/2402.04494

  • Post Author
    jurschreuder
    Posted March 14, 2025 at 7:24 pm

    This concept comes from Hopfield networks.

    If two nodes are on, but the connection between them is negative, this causes energy to be higher.

    If one of those nodes switches off, energy is reduced.

    With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".

    All you can do then is try to get the energy as low as possible.

    This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.

  • Post Author
    d--b
    Posted March 14, 2025 at 7:44 pm

    Well, it could be argued that the “optimal response” ie the one that sorta minimizes that “energy” is sorted by LLMs on the first iteration. And further iterations aren’t adding any useful information and in fact are countless occasions to veer off the optimal response.

    For example if a prompt is: “what is the Statue of Liberty”, the LLMs first output token is going to be “the”, but it kinda already “knows” that the next ones are going to be “statue of liberty”.

    So to me LLMs already “choose” a response path from the first token.

    Conversely, a LLM that would try and find a minimum energy for the whole response wouldn’t necessarily stop hallucinating. There is nothing in the training of a model that says that “I don’t know” has a lower “energy” than a wrong answer…

  • Post Author
    rglover
    Posted March 14, 2025 at 8:06 pm

    Not an ML researcher, but implementing these systems has shown this opinion to be correct. The non-determinism of LLMs is a feature, not a bug that can be fixed.

    As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.

    There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.

  • Post Author
    EEgads
    Posted March 14, 2025 at 8:27 pm

    Yann LeCun understands this is an electrical engineering and physical statistics of machine problem and not a code problem.

    The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.

    The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.

    Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.

  • Post Author
    bobosha
    Posted March 14, 2025 at 8:39 pm

    I argue that JEPA and its Energy-Based Model (EBM) framework fail to capture the deeply intertwined nature of learning and prediction in the human brain—the “yin and yang” of intelligence. Contemporary machine learning approaches remain heavily reliant on resource-intensive, front-loaded training phases. I advocate for a paradigm shift toward seamlessly integrating training and prediction, aligning with the principles of online learning.

    Disclosure: I am the author of this paper.

    Reference:
    (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh… [accessed Mar 14, 2025].

  • Post Author
    inimino
    Posted March 14, 2025 at 8:47 pm

    I have a paper coming up that I modestly hope will clarify some of this.

    The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.

  • Post Author
    snats
    Posted March 14, 2025 at 9:20 pm

    Not an insider but imo the work on diffusion language models like LLaDA is really exciting. It's pretty obvious that LLMs are good but they are pretty slow. And in a world where people want agents you want a lot of the time something that might not be that smart but is capable of going really fast + searches fast. You only need to solve search in a specific domain for most agents. You don't need to solve the entire knowledge of human history in a single set of weights

  • Post Author
    eximius
    Posted March 14, 2025 at 9:31 pm

    I believe that so long as weights are fixed at inference time, we'll be at a dead end.

    Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.

    Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.

  • Post Author
    simne
    Posted March 14, 2025 at 9:32 pm

    I'm not deep researcher, more like amateur, but could explain some things.

    Most problem with current approach, to grow abilities, need to add more neurons, but this is not just energy consuming, but also knowledge consuming, mean, at GPT-4 level all text sources of humanity already exhausted and model become essentially overfitted. So looks like multi-modal models appear not because so good, but because they could learn on additional sources (audio/video).

    I seen few approaches to overcome problem of overfitting, but as I understand not exist universal solution.

    For example, tried approach to create from current texts some synthetic training data, but this idea is limited by definition.

    So, current LLMs appear to hit dead end, and researchers now trying to find exit from this dead end. I believe, nearest years somebody will invent some universal solution (probably, complex of approaches) or suggest another architecture, and progress of AI will continue.

  • Post Author
    giantg2
    Posted March 14, 2025 at 9:34 pm

    I feel like some hallucinations aren't bad. Isn't that basically what a new idea is – a hallucination of what could be? The ability to come up with new things, even if they're sometimes wrong, can be useful and happen all the time with humans.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.