Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The criticisms are not strawmans, are actually well grounded on math. For instance, promoting energy based models.

In a probability distribution model, the model is always forced to output a probability for a set of tokens, even if all the states are non sense. In an energy based model, the model can infer that a states makes no sense at all and can backtrack by itself.

Notice that diffusion models, DINO and other successful models are energy based models, or end up being good proxies of the data density (density is a proxy of entropy ~ information).

Finally, all probability models can be thought as energy based, but not all EBM output probabilities distributions.

So, his argument is not against transformers or the architectures themselves, but more about the learned geometry.





I'm really fucking math dumb. Can you explain what the "well grounded" part is, for the mathematically challenged?

Because all I've seen from the "energy based" approach in practice is a lot of hype and not a lot of results. If it isn't applicable to LLMs, then what is it applicable to? Where does it give an advantage? Why would you want it?

I really, genuinely don't get that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: