2025: The Year in LLMs

ksec · 2026-01-01T07:40:29 1767253229

All these improvement in a single year, 2025. While this may seem obvious to those who follows along the AI / LLM news. It may be worth pointing out again ChatGPT was introduced to us in November 2022.

I still dont believe AGI, ASI or Whatever AI will take over human in short period of time say 10 - 20 years. But it is hard to argue against the value of current AI, which many of the vocal critics on HN seems to have the opinion of. People are willing to pay $200 per month, and it is getting $1B dollar runway already.

Being more of a Hardware person, the most interesting part to me is the funding of all the developments of latest hardware. I know this is another topic HN hate because of the DRAM and NAND pricing issue. But it is exciting to see this from a long term view where the pricing are short term pain. Right now the industry is asking, we have together over a trillion dollar to spend on Capex over the next few years and will even borrow more if it needs to be, when can you ship us 16A / 14A / 10A and 8A or 5A, LPDDR6, Higher Capacity DRAM at lower power usage, better packaging, higher speed PCIe or a jump to optical interconnect? Every single part of the hardware stack are being fused with money and demand. The last time we have this was Post-PC / Smartphone era which drove the hardware industry forward for 10 - 15 years. The current AI can at least push hardware for another 5 - 6 years while pulling forward tech that was initially 8 - 10 years away.

I so wished I brought some Nvidia stock. Again, I guess no one knew AI would be as big as it is today, and it is only just started.

wpietri · 2026-01-01T13:26:48 1767274008

This is not a great argument:

> But it is hard to argue against the value of current AI [...] it is getting $1B dollar runway already.

The psychic services industry makes over $2 billion a year in the US [1], with about a quarter of the population being actual believers. [2].

[1] The https://www.ibisworld.com/united-states/industry/psychic-ser...

[2] https://news.gallup.com/poll/692738/paranormal-phenomena-met...

apexalpha · 2026-01-01T13:36:19 1767274579

What if these provide actual value through placebo-effect?

wpietri · 2026-01-01T15:07:44 1767280064

I think we have different definitions of "actual value". But even if I pick the flaccid definition, that isn't proof of value of the thing itself, but of any placebo. In which case we can focus on the cheapest/least harmful placebo. Or, better, solving the underlying problem that the placebo "helps".

computably · 2026-01-01T16:02:31 1767283351

I'll preface by saying I fully agree that psychics aren't providing any non-placebo value to believers, although I think it's fine to provide entertainment for non-believers.

> Or, better, solving the underlying problem that the placebo "helps".

The underlying problems are often a lack of a decent education and a generally difficult/unsatisfying life. Systemic issues which can't be meaningfully "solved" without massive resources and political will.

wpietri · 2026-01-02T13:34:33 1767360873

If we look back over the last century or so, I think we've made excellent progress on that. The main current barrier is that we've lately let people with various pathologies run wild, but historically that creates enough problems that the political will emerges. See, e.g., the American and French revolutions, or India's independence, or the US civil war and Reconstruction.

jay_kyburz · 2026-01-01T19:03:44 1767294224

Actually, I'd go one step further and say they are harmful to everybody else.

It might just be my circles, but I've seen Carl Sagans quote everywhere in the last couple of months.

"“Science is more than a body of knowledge; it is a way of thinking. I have a foreboding of an America in my children’s or grandchildren’s time—when the United States is a service and information economy; when nearly all the key manufacturing industries have slipped away to other countries; when awesome technological powers are in the hands of a very few, and no one representing the public interest can even grasp the issues; when the people have lost the ability to set their own agendas or knowledgeably question those in authority; when, clutching our crystals and nervously consulting our horoscopes, our critical faculties in decline, unable to distinguish between what feels good and what’s true, we slide, almost without noticing, back into superstition and darkness.”"

recursive · 2026-01-01T13:59:50 1767275990

You talking about psychics or LLMs?

grosswait · 2026-01-01T14:17:42 1767277062

ctoth · 2026-01-01T17:06:40 1767287200

2022/2023: "It hallucinates, it's a toy, it's useless."

2024/2025: "Okay, it works, but it produces security vulnerabilities and makes junior devs lazy."

2026 (Current): "It is literally the same thing as a psychic scam."

Can we at least make predictions for 2027? What shall the cope be then! Lemme go ask my psychic.

wpietri · 2026-01-02T03:28:09 1767324489

I suppose it's appropriate that you hallucinated an argument I did not make, attacked the straw man, and declared victory.

ben_w · 2026-01-02T10:07:15 1767348435

Ironically, the human tendency to read far too much into things for which we have far too little data, does seem to still be one of the ways we (and all biological neural nets) are more sample-efficient than any machine learning.

I have no idea if those two points, ML and brains, are just different points on the same Pareto frontier of some useful metrics, but I am increasingly suspecting they might be.

bopbopbop7 · 2026-01-01T17:25:38 1767288338

2022/2023: "Next year software engineering is dead"

2024: "Now this time for real, software engineering is dead in 6 months, AI CEO said so"

2025: "I know a guy who knows a guy who built a startup with an LLM in 3 hours, software engineering is dead next year!"

What will be the cope for you this year?

ben_w · 2026-01-02T10:28:04 1767349684

I went from using ChatGPT 3.5 for functions and occasional scripts…

… to one of the models in Jan 2024 being able to repeatedly add features to the same single-page web app without corrupting its own work or hallucinating the APIs it had itself previously generated…

… to last month using a gifted free week of Claude Code to finish one project and then also have enough tokens left over to start another fresh project which, on that free left-over credit, reached a state that, while definitely not well engineered, was still better than some of the human-made pre-GenAI nonsense I've had to work with.

Wasn't 3 hours, and I won't be working on that thing more this month either because I am going to be doing intensive German language study with the goal of getting the language certificate I need for dual citizenship, but from the speed of work? 3 weeks to make a startup is already plausible.

I won't say that "software engineering" is dead. In a lot of cases however "writing code" is dead, and the job of the engineer should now be to do code review and to know what refactors to ask for.

bopbopbop7 · 2026-01-02T17:10:39 1767373839

So you did some basic web development and built a "not well engineered" greenfield app that you didn't ship, and from that your conclusion is that "writing code is dead"?

aspenmartin · 2026-01-01T20:15:30 1767298530

The cope + disappointment will be knowing that a large population of HN users will paint a weird alternative reality. There are a multitude of messages about AI that are out there, some are highly detached from reality (on the optimistic and pessimistic side). And then there is the rational middle, professionals who see the obvious value of coding agents in their workflow and use them extensively (or figure out how to best leverage them to get the most mileage). I don't see software engineering being "dead" ever, but the nature of the job _has already changed_ and will continue to change. Look at Sonnet 3.5 -> 3.7 -> 4.5 -> Opus 4.5; that was 17 months of development and the leaps in performance are quite impressive. You then have massive hardware buildouts and improvements to stack + a ton of R&D + competition to squeeze the juice out of the current paradigm (there are 4 orders of magnitude of scaling left before we hit real bottlenecks) and also push towards the next paradigm to solve things like continual learning. Some folks have opted not to use coding agents (and some folks like yourself seem to revel in strawmanning people who point out their demonstrable usefulness). Not using coding agents in Jan 2026 is defensible. It won't be defensible for long.

bopbopbop7 · 2026-01-01T20:25:58 1767299158

Please do provide some data for this "obvious value of coding agents". Because right now the only thing obvious is the increase in vulnerabilities, people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

aspenmartin · 2026-01-01T20:39:15 1767299955

Sure: at my MAANG company, where I watch the data closely on adoption of CC and other internal coding agent tools, most (significant) LOC are written by agents, and most employees have adopted coding agents as WAU, and the adoption rate is positively correlated with seniority.

Like a lot of things LLM related (Simon Willison's pelican test, researchers + product leaders implementing AI features) I also heavily "vibe" check the capabilities myself on real work tasks. The fact of the matter is I am able to dramatically speed up my work. It may be actually writing production code + helping me review it, or it may be tasks like: write me a script to diagnose this bug I have, or build me a streamlit dashboard to analyze + visualize this ad hoc data instead of me taking 1 hour to make visualizations + munge data in a notebook.

> people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

what would satisfy you here? I feel you are strawmanning a bit by picking the most hyperbolic statements and then blanketing that on everyone else.

My workflow is now:

- Write code exclusively with Claude

- Review the code myself + use Claude as a sort of review assistant to help me understand decisions about parts of the code I'm confused about

- Provide feedback to Claude to change / steer it away or towards approaches

- Give up when Claude is hopelessly lost

It takes a bit to get the hang of the right balance but in my personal experience (which I doubt you will take seriously but nevertheless): it is quite the game changer and that's coming from someone who would have laughed at the idea of a $200 coding agent subscription 1 year ago

Denzel · 2026-01-02T02:30:35 1767321035

We probably work at the same company, given you used MAANG instead of FAANG.

As one of the WAU (really DAU) you’re talking about, I want to call out a couple things: 1) the LOC metrics are flawed, and anyone using the agents knows this - eg, ask CC to rewrite the 1 commit you wrote into 5 different commits, now you have 5 100% AI-written commits; 2) total speed up across the entire dev lifecycle is far below 10x, most likely below 2x, but I don’t see any evidence of anyone measuring the counterfactuals to prove speed up anyways, so there’s no clear data; 3) look at token spend for power users, you might be surprised by how many SWE-years they’re spending.

Overall it’s unclear whether LLM-assisted coding is ROI-positive.

aspenmartin · 2026-01-02T13:21:36 1767360096

Oh yes all of this I agree with. I had tried to clarify this above but your examples are clearer: my point is: all measures and studies I have personally seen of AI impact on productivity have been deeply flawed for one reason or another.

Total speed up is WAY less than 10x by any measure. 2x seems too high too.

By data alone it’s a bit unclear of impact I agree. But I will say there seems to be a clear picture that to me, starting from a prior formed from personal experience, indicates some real productivity impact today, with a trajectory that suggests these claims of a lot of SWE work being offloaded to agents over the next few years seems not that far fetched.

- adoption and retention numbers internally and externally. You can argue this is driven by perverse incentives and/or the perception performance mismatch but I’m highly skeptical of this even though the effects of both are probably really, it would be truly extraordinary to me if there weren’t at least a ~10-20% bump in productivity today and a lot of headroom to go as integration gets better and user skill gets better and model capabilities grow

- benchmark performance, again benchmarks are really problematic but there are a lot of them and all of them together paint a pretty clear picture of capabilities truly growing and growing quickly

- there are clearly biases we can think of that would cause us to overestimate AI impact, but there are also biases that may cause us to underestimate impact: e.g. I’m now able to do work that I would have never attempted before. Multitasking is easier. Experiments are quicker and easier. That may not be captured well by e.g. task completion time or other metrics.

I even agree: quality of agentic code can be a real risk, but:

- I think this ignores the fact that humans have also always written shitty code and always will; there is lots of garbage in production believe me, and that predates agentic code

- as models improve, they can correct earlier mistakes

- it’s also a muscle to grow: how to review and use humans in the loop to improve quality and set a high bar

ben_w · 2026-01-02T10:53:32 1767351212

To add to your point:

If the M stands for Meta, I would also like to note that as a user, I have been seeing increasingly poor UI, of the sort I'd expect from people committing code that wasn't properly checked before going live, as I would expect from vibe coding in the original sense of "blindly accept without review". Like, some posts have two copies of the sender's name in the same location on screen with slightly different fonts going out of sync with each other.

I can easily believe the metrics that all [MF]AANG bonuses are denominated in are going up, our profession has had jokes about engineers gaming those metrics even back when our comics were still printed in books: https://imgur.com/bug-free-programs-dilbert-classic-tyXXh1d

bopbopbop7 · 2026-01-01T20:54:53 1767300893

Anecdotes don’t prove anything, ones without any metrics, and especially at MAANG where AI use is strongly incentivized.

Evidence is peer reviewed research, or at least something with metrics. Like the METR study that shows that experienced engineers often got slower on real tasks with AI tools, even though they thought they were faster.

aspenmartin · 2026-01-01T22:16:11 1767305771

That's why I gave you data! METR study was 16 people using Sonnet 3.5/3.7. Data I'm talking about is 10s of thousands of people and is much more up to date.

Some counter examples to METR that are in the literature but I'll just say: "rigor" here is very difficult (including METR) because outcomes are high dimensional and nuanced, or ecological validity is an issue. It's hard to have any approach that someone wouldn't be able to dismiss due to some issue they have with the methodology. The sources below also have methodological problems just like METR

https://arxiv.org/pdf/2302.06590 -- 55% faster implementing HTTP server in javascript with copilot (in 2023!) but this is a single task and not really representative.

https://demirermert.github.io/Papers/Demirer_AI_productivity... -- "Though each experiment is noisy, when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool. Notably, less experienced developers had higher adoption rates and greater productivity gains." (but e.g. "completed tasks" as the outcome measure is of course problematic)

To me, internal company measures for large tech companies will be most reliable -- they are easiest to track and measure, the scale is large enough, and the talent + task pool is diverse (junior -> senior, different product areas, different types of tasks). But then outcome measures are always a problem...commits per developer per month? LOC? task completion time? all of them are highly problematic, especially because its reasonable to expect AI tools would change the bias and variance of the proxy so its never clear if you're measuring the change in "style" or the change in the underlying latent measure of productivity you care about

bopbopbop7 · 2026-01-01T22:38:44 1767307124

To be fair, I’ll take a non-biased 16 person study over “internal measures” from a MAANG company that burned 100s of billions on AI with no ROI that is now forcing its employees to use AI.

anorwell · 2026-01-02T02:04:56 1767319496

What do you think about the METR 50% task length results? About benchmark progress generally?

ben_w · 2026-01-02T11:02:33 1767351753

I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.

The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.

This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.

As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.

When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.

But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."

It works on my [use case], but we can't always ship my [use case].

aspenmartin · 2026-01-02T00:08:21 1767312501

I could have guessed you would say that :) but METR is not an unbiased study either. Maybe you mean that METR is less likely to intentionally inflate their numbers?

If you insist or believe in a conspiracy I don’t think there’s really anything I or others will be able to say or show you that would assuage you, all I can say is I’ve seen the raw data. It’s a mess and again we’re stuck with proxies (which are bad since you start conflating the change in the proxy-latent relationship with the treatment effect). And it’s also hard and arguably irresponsible to run RCTs.

All I will say is: there are flaws everywhere. METR results are far from conclusive. Totally understandable if there is a mismatch between perception and performance. But also consider: even if task takes the same or even slightly more time, one big advantage for me is that it substantially reduces cognitive load so I can work in parallel sessions on two completely different issues.

bopbopbop7 · 2026-01-02T00:28:24 1767313704

I bet it does reduce your cognitive load, considering you, in your own words "Give up when Claude is hopelessly lost". No better way to reduce cognitive load.

aspenmartin · 2026-01-02T01:03:22 1767315802

I give up using Claude when it gets hopelessly lost, and then my cognitive load increases.

Ianjit · 2026-01-01T23:59:26 1767311966

Meta internal study showed a 6-12% productivity uplift.

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

insin · 2026-01-01T22:40:57 1767307257

> - Give up when Claude is hopelessly lost

You love to see "Maybe completely waste my time" as part of the normal flow for a productivity tool

aspenmartin · 2026-01-02T00:09:51 1767312591

That negates everything else? If you have a tool that can boost you for 80% of your work and for the other 20% you just have to do what you’re already doing, is that bad?

shimman · 2026-01-02T01:45:44 1767318344

There's a reason why sunk cost IS a fallacy and not a sound strategy.

Ianjit · 2026-01-01T23:56:35 1767311795

The productivity uplift is massive, Meta got a 6-12% productivity uplift from AI coding!

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

ben_w · 2026-01-02T10:43:33 1767350613

> You then have massive hardware buildouts and improvements to stack + a ton of R&D + competition to squeeze the juice out of the current paradigm (there are 4 orders of magnitude of scaling left before we hit real bottlenecks)

This is a surprising claim. There's only 3 orders of magnitude between US data centre electricity consumption and worldwide primary energy (as in, not just electricity) production. Worldwide electricity supply is about 3/20ths of world primary energy, so without very rapid increases in electricity supply there's really only a little more than 2 orders of magnitude growth possible in compute.

Renewables are growing fast, but "fast" means "will approach 100% of current electricity demand by about 2032". Which trend is faster, growth of renewable electricity or growth of compute? Trick question, compute is always constrained by electricity supply, and renewable electricity is growing faster than anything else can right now.

aspenmartin · 2026-01-02T13:31:25 1767360685

This is not my own claim, it’s based on the following analysis from Epoch: https://epoch.ai/blog/can-ai-scaling-continue-through-2030

But I forgot how old that article is: it’s 4 orders of magnitude past GPT-4 in terms of total compute which is I think only 3.5 orders of magnitude from where we are today (based on 4.4x scaling/yr)

nsxwolf · 2026-01-01T20:28:04 1767299284

The nature of my job has always been fighting red tape, process, and stake holders to deploy very small units of code to production. AI really did not help with much of that for me in 2025.

I'd imagine I'm not the only one who has a similar situation. Until all those people and processes can be swept away in favor of letting LLMS YOLO everything into production, I don't see how that changes.

aspenmartin · 2026-01-01T20:41:12 1767300072

No I think that's extremely correct. I work at a MAANG where we have the resources to hook up custom internal LLMs and agents to actually deal with that but that is unique to an org of our scale.

jillesvangurp · 2026-01-01T09:37:53 1767260273

2025 was the year of development tool using AI agents. I think we'll shift attention to non development tool using AI agents. Most business users are still stuck using chat gpt as some kind of grand oracle that will write their email or powerpoint slides. There are bits and pieces of mostly technology demo level solutions but nothing that is widely used like AI coding tools are so far. I don't think this is bottle necked on model quality.

I don't need an AGI. I do need a secretary type agent that deals with all the simple but yet laborious non technical tasks that keep infringing on my quality engineering time. I'm CTO for a small startup and the amount of non technical bullshit that I need to deal with is enormous. Some examples of random crap I deal with: figuring out contracts, their meaning/implication to situations, and deciding on a course of action; Customer offers, price calculations, scraping invoices from emails and online SAAS accounts, formulating detailed replies to customer requests, HR legal work, corporate bureaucracy, financial planning, etc.

A lot of this stuff can be AI assisted (and we get a lot of value out of ai tools for this) but context engineering is taking up a non trivial amount of my time. Also most tools are completely useless at modifying structured documents. Refactoring a big code base, no problem. Adding structured text to an existing structured document, hardest thing ever. The state of the art here is an ff-ing sidebar that will suggest you a markdown formatted text that you might copy/paste. Tool quality is very primitive. And then you find yourself just stripping all formatting and reformatting it manually. Because the tools really suck at this.

arcatech · 2026-01-01T14:42:48 1767278568

> Some examples of random crap I deal with: figuring out contracts, their meaning/implication to situations, and deciding on a course of action

This doesn’t sound like bullshit you should hand off to an AI. It sounds like stuff you would care about.

jillesvangurp · 2026-01-01T15:48:14 1767282494

I do care about it; kind of my duty as a co-founder. Which is why I'm spending double digit percentages of my time doing this stuff. But I absolutely could use some tools to cut down on a lot of the drudgery that is involved with this. And me reading through 40 pages of dense legal German isn't one of my strengths since I 1) do not speak German 2) am not a lawyer and 3) am not necessarily deeply familiar with all the bureaucracy, laws, etc.

But I can ask intelligent questions about that contract from an LLM (in English) and shoot back and forth a few things, come up with some kind of action plan, and then run it by our laywers and other advisors.

That's not some kind of hypothetical thing. That's something that happened multiple times in our company in the last few months. LLMs are very empowering for dealing with this sort of thing. You still need experts for some stuff. But you can do a lot more yourself now. And as we've found out, some of the "experts" that we relied on in the past actually did a pretty shoddy job. A lot of this stuff was about picking apart the mess they made and fixing it.

As soon as you start drafting contracts, it gets a lot harder. I just went through a process like that as well. It involves a lot of manual work that is basically about formatting documents, drafting text, running pdfs and text snippets through chat gpt for feedback, sparring, criticism, etc. and iterating on that. This is not about vibe coding some contract but making sure every letter of a contract is right. That ultimately involves lawyers and negotiating with other stakeholders but it helps if you come prepared with a more or less ready to sign off on document.

It's not about handing stuff off but about making LLMs work for you. Just like with coding tools. I care about code quality as well. But I still use the tools to save me a lot of time.

simonw · 2026-01-01T16:07:49 1767283669

One of the lessons I learned running a startup is that it doesn't matter how good the professionals you hire are for things like legal and accounting, you still need to put work in yourself.

Everyone makes mistakes and misses things, and as the co-founder you have to care more about the details than anyone else does.

I would have loved to have weird-unreliable-paralegal-Claude available back when I was doing that!

nrclark · 2026-01-01T14:47:41 1767278861

Agree. Even asking it can anchor your thinking.

jennyholzer3 · 2026-01-02T14:16:09 1767363369

you don't need AGI, you need human labor

topaztee · 2026-01-01T15:55:00 1767282900

`Also most tools are completely useless at modifying structured documents`

we built a tool for this for the life science space and are opening it up to the general public very soon. Email me I can give you access (topaz at vespper dot com)

utopiah · 2026-01-01T08:20:36 1767255636

> All these improvement in a single year

> hard to argue against the value of current AI

> People are willing to pay $200 per month, and it is getting $1B dollar runway already.

Those are 3 different things. There can be a LOT of fast and significant improvements but still remain extremely far from the actual goal, so far it looks like actually little progress.

People pay for a lot of things, including snake oil, so convincing a lot of people to pay a bit is not in itself a proof of value, especially when some people are basically coerced into this, see how many companies changed their "strategy" to mandating AI usage internally, or integration for a captive audience e.g. Copilot.

Finally yes, $1B is a LOT of money for you and I... but for the largest corporations it's actually not a lot. For reference Google earned that in revenue... per day in 2023. Anyway that's still a big number BUT it still has to be compared with, well how much does OpenAI burn. I don't have any public number on that but I believe the consensus is that it's a lot. So until we know that number we can't talk about an actual runway.

aspenmartin · 2026-01-01T20:23:06 1767298986

> People pay for a lot of things, including snake oil, so convincing a lot of people to pay a bit is not in itself a proof of value

But do you really believe e.g. Claude code is snake oil? I pay $200 / month for Claude, which is something I would have thought monumentally insane maybe 1-2 years ago (e.g. when ChatGPT came out with their premium subscription price I thought that seemed so out of touch). I don't think we would be seeing the subscription rates and the retention numbers if it really was snake oil.

> Finally yes, $1B is a LOT of money for you and I... but for the largest corporations it's actually not a lot. For reference Google earned that in revenue... per day in 2023. Anyway that's still a big number BUT it still has to be compared with, well how much does OpenAI burn. I don't have any public number on that but I believe the consensus is that it's a lot. So until we know that number we can't talk about an actual runway.

this gets brought up a lot but I'm not sure I understand why folks on a forum called YCombinator, a startup accelerator, would make this sound like an obvious sign of charlatanism; operating at a loss is nothing new and anthropic / openAI strategy seems perfectly rational: they are scaling and capturing market share, and TAM is insane.

pjc50 · 2026-01-01T09:39:03 1767260343

Investing a trillion dollars for a revenue of a billion dollars doesn't sound great yet.

signatoremo · 2026-01-02T08:03:11 1767340991

Companies benefiting from trillion dollars spent during the dotcom era certainly make more than a billion dollars, for the last 20 years.

Intellectual dishonesty is certainly rampant on HN.

steveBK123 · 2026-01-01T14:11:55 1767276715

Indeed, its the old Uber playbook at nearly two extra orders of magnitude.

It is a large enough number to simply run out of private capital to consume before it turns cash flow positive.

Lots of things sell well if sold at such a loss. I’d take a new Ferrari for $2500 if it was on offer.

pjc50 · 2026-01-01T19:34:11 1767296051

Did Uber actually do a lot of capital investment? They don't own the cars, for example.

shimman · 2026-01-02T01:46:45 1767318405

Uber nakedly broke the law and beat down labor, I'm honestly shocked none of the executives went to prison.

simonw · 2026-01-01T19:55:45 1767297345

I believe they spent a huge amount of money on incentives to help sign up drivers, and discounts to help attract customers.

pjc50 · 2026-01-02T01:08:16 1767316096

Yes, but that's loss leader rather than capital investment. You can't put a customer on the balance sheet and depreciate them. Once you've paid for a free ride, you own nothing tangible.

aoeusnth1 · 2026-01-01T18:56:12 1767293772

You say that as if Uber's playbook didn't work. Try this: https://www.google.com/finance/quote/UBER:NYSE

derwiki · 2026-01-01T14:56:02 1767279362

Uber’s playbook worked for Uber

layer8 · 2026-01-01T22:13:41 1767305621

> Every single part of the hardware stack are being fused with money and demand. The last time we have this was Post-PC / Smartphone era which drove the hardware industry forward for 10 - 15 years. The current AI can at least push hardware for another 5 - 6 years while pulling forward tech that was initially 8 - 10 years away.

It’s very unclear how much end-consumer hardware and DIY builders will benefit from that, as opposed to server-grade hardware that only makes sense for the enterprise marker. It could have the opposite effect, like hardware manufacturers leaving the consumer market (as in the case of Micron), because there’s just not that much money in it.

HumblyTossed · 2026-01-01T19:05:29 1767294329

It's a great tool, but right now it's only being used to feed the greed.

>> Again, I guess no one knew AI would be as big as it is today, and it is only just started.

People have been saying similar about self driving cars for years now. "AI" is another one of those expensive ideas that we'll get 85% of the way there and then to get the other 15% will be way more expensive than anyone will want to pay for. It's already happening - HW prices and electricity - people are starting to ask, "if I put more $ into this machine, when am I actually going to start getting money out?" The "true believers" are like, soon! But people are right to be hugely skeptical.

jliptzin · 2026-01-01T19:45:28 1767296728

There are some things it's really great at. For example, handling a css layout. If we have to spend trillions of dollars and get nothing else out of it other than being able to vertically center a <div> without wrestling with css and wanting to smash the keyboard in the process, it will all have been worth it.

falkensmaize · 2026-01-02T05:15:42 1767330942

Not to be cheeky, but isn’t this just

display: flex; align-items:center;

now?

aspenmartin · 2026-01-01T20:28:23 1767299303

I agree -- skepticism is totally healthy. And there are so many great ways to poke holes in the true underlying narratives (not the headlines that people seem to pull from). E.g. evaluation science is a wasteland (not for wont of very smart people trying very hard to get them right). How do we tackle the power requirements in a way that is sustainable? Etc. etc.

But stuff like this im not sure I understand:

> It's a great tool, but right now it's only being used to feed the greed.

if its a great tool, then how is it _only_ being used to "feed the greed" and what do you mean by that?

Also I think folks are quick to make analogies to other points in history: "AI is like the dot com boom we're going to crash and burn" and "AI is like {self driving cars, crypto, etc} and the promises will all be broken, its all hype" but this removes the nuance: all of these things are extremely different with very specific dynamics that in _some_ ways may be similar but in many crucial and important ways are completely different.

HumblyTossed · 2026-01-02T00:49:07 1767314947

>> if its a great tool, then how is it _only_ being used to "feed the greed" and what do you mean by that?

Look around?

signatoremo · 2026-01-02T08:06:30 1767341190

You meant someone using Claude Code is greedy?

aspenmartin · 2026-01-02T01:02:20 1767315740

Very confused, I still don’t know what you mean at all

coffeebeqn · 2026-01-01T08:03:40 1767254620

Seems like Nvidia will be focusing on the super beefy GPUs and leaving the consumer market to a smaller player

Flow · 2026-01-01T09:55:34 1767261334

I don't get why Nvidia can't do both? Is it because of the limited production capabilities of the factories?

ACCount37 · 2026-01-01T10:20:22 1767262822

Yes. If you're bottlenecked on silicon and secondaries like memory, why would you want to put more of those resources into lower margin consumer products if you could use those very resources to make and sell more high margin AI accelerators instead?

From a business standpoint, it makes some sense to throttle the gaming supply some. Not to the point of surrendering the market to someone else probably, but to a measurable degree.

ksec · 2026-01-01T12:42:57 1767271377

We will have to wait and see but my bet is that Nvidia will move to Leading Edge node N2 earlier now they have the Margin to work with. Both Hopper and Blackwell were too late in the design cycle. The AI hype and continue to buy the latest and great leaving Gaming at a mainstream node.

Nvidia using Mainstream node has always been the norm considering most Fab capacity always goes to Mobile SoC first. But I expect the internet / gamers will be angry anyway because Nvidia does not provide them with the latest and greatest.

In reality the extra R&D cost for designing with leading edge will be amortised by all the AI order which give Nvidia competitive advantage at the consumer level when they compete. That is assuming there are competition because most recent data have shown Nvidia owning 90%+ of discreet market share, 9% for AMD and 1% for Intel.

_s · 2026-01-01T08:20:18 1767255618

AMD owns a lot of the consumer market already; handhelds, consoles, desktop rigs and mobile ... they are not a small player.

ac29 · 2026-01-02T09:05:51 1767344751

Intel's client computing revenue was greater than AMD's entire revenue last quarter

utopiah · 2026-01-01T08:21:28 1767255688

They said "smaller" not small.

chias · 2026-01-01T08:31:40 1767256300

These are not all improvements. Listed:

* The year of YOLO and the Normalization of Deviance

* The year that Llama lost its way

* The year of alarmingly AI-enabled browsers

* The year of the lethal trifecta

* The year of slop

* The year that data centers got extremely unpopular

Y_Y · 2026-01-01T18:33:14 1767292394

Not that YOLO, PJ Reddie released that in 2015

mbesto · 2026-01-01T15:26:09 1767281169

Said differently - the year we start to see all of the externalities of a globally scaled hyped tech trend.

steveBK123 · 2026-01-01T14:16:39 1767276999

> * The year that data centers got extremely unpopular

I was discussing the political angle with a friend recently. I think Big Tech Bro / VC complex has done themselves a big disservice by aligning so tightly with MAGA to the point AI will be a political issue in 2026 & 2028.

Think about the message they’ve inadvertently created themselves - AI is going to replace jobs, it’s pushing electric prices up, we need the government to bail us out AND give us a regulatory light touch.

Super easy campaign for Dems - big tech trumpers are taking your money, your jobs, causing inflation, and now they want bailouts !!

Atomic_Torrfisk · 2026-01-01T16:06:39 1767283599

> People are willing to pay $200 per month

Some people are of course, but how many?

> ... People are willing to pay $200 per month

This is just low-key hype. Careful with your portfolio...

ACCount37 · 2026-01-01T10:21:24 1767262884

Is the AI progress in 2025 an outstanding breakthrough? Not really. It's impressive but incremental.

Still, the gap between the capabilities of a cutting edge LLM and that of a human is only this wide. There are only this many increments it takes to cross it.

belter · 2026-01-01T19:26:12 1767295572

>> But it is hard to argue against the value of current AI, which many of the vocal critics on HN seems to have the opinion of.

What is the concrete business case? Can anyone point to a revenue producing company using AI in production, and where AI is a material driver of profits?

Tool vendors don’t count. I’m not interested in how much money is being made selling shovels...show me a miner who actually struck gold please.

tim333 · 2026-01-02T00:13:13 1767312793

A lot of programmers seem willing to pay for the likes of Claude Code, presumably because it helps them get more done. Programmers cost money so that's a potential cost saving?

tstrimple · 2026-01-01T10:02:12 1767261732

[flagged]

cherryteastain · 2026-01-01T10:54:36 1767264876

Sam Altman [1] certainly seems to talk about AGI quite a bit

[1] https://blog.samaltman.com/reflections

ACCount37 · 2026-01-01T10:30:14 1767263414

Honestly, I wouldn't be surprised if a system that's an LLM at its core can attain AGI. With nothing but incremental advances in architecture, scaffolding, training and raw scale.

Mostly the training. I put less and less weight on "LLMs are fundamentally flawed" and more and more of it on "you're training them wrong". Too many "fundamental limitations" of LLMs are ones you can move the needle on with better training alone.

The foundation of LLM is flexible and capable, and the list of "capabilities that are exclusive to human mind" is ever shrinking.

tim333 · 2026-01-01T15:27:48 1767281268

They seem to be missing a bit on learning as you go and thinking about things and getting new insights.

ACCount37 · 2026-01-01T23:03:32 1767308612

In-context learning and reasoning cover that already, and you could expand on that. Nothing prevents an LLM from fine-tuning itself either, other than its own questionable fine tuning skills and the compute budget.

HarHarVeryFunny · 2026-01-01T15:28:18 1767281298

That depends on how you define AGI - it's a meaningless term to use since everyone uses it to mean different things. What exactly do you mean ?!

Yes, there is a lot that can be improved via different training, but at what point is it no longer a language model (i.e. something that auto-regressively predicts language continuations)?

I like to use an analogy to the children's "Stone Soup" story whereby a "stone soup" (starting off as a stone in a pot of boiling water) gets transformed into a tasty soup/stew by strangers incrementally adding extra ingredients to "improve the flavor" - first a carrot, then a bit of beef, etc. At what point do you accept that the resulting tasty soup is not in fact stone soup?! It's like taking an auto-regressively SGD-trained Transformer, and incrementally tweaking the architecture, training algorithm, training objective, etc, etc. At some point it becomes a bit perverse to choose to still call it a language model

Some of the "it's just training" changes that would be needed to make today's LLMs more brain-like may be things like changing the training objective completely from auto-regressive to predicting external events (with the goal of having it be able to learn the outcomes of it's own actions, in order to be able to plan them), which to be useful would require the "LLM" to then be autonomous and act in some (real/virtual) world in order to learn.

Another "it's just training" change would be to replace pre/mid/post-training with continual/incremental runtime learning to again make the model more brain-like and able to learn from it's own autonomous exploration of behavior/action and environment. This is a far more profound, and ambitious, change than just fudging incremental knowledge acquisition for some semblance of "on the job" learning (which is what the AI companies are currently working on).

If you put these two "it's just training/learning" enhancements together then you've now got something much more animal/human-like, and much more capable than an LLM, but it's already far from a language model - something that passively predicts next word every time you push the "generate next word" button. This would now be an autonomous agent, learning how to act and control/exploit the world around it. The whole pre-trained, same-for-everyone, model running in the cloud, would then be radically different - every model instance is then more like an individual learning based on it's own experience, and maybe you're now paying for compute for the continual learning compute rather than just "LLM tokens generated".

These are "just" training (and deployment!) changes, but to more closely approach human capability (but again, what to you mean by "AGI"?) there would also need to be architectural changes and additions to the "Transformer" architecture (add looping, internal memory, etc), depending on exactly how close you want to get to human/animal capability.

ACCount37 · 2026-01-02T11:23:55 1767353035

> which to be useful would require the "LLM" to then be autonomous and act in some (real/virtual) world in order to learn.

You described modern RLVR for tasks like coding. Plug an LLM into a virtual env with a task. Drill it based on task completion. Force it to get better at problem-solving.

It's still an autoregressive next token prediction engine. 100% LLM, zero architectural changes. We just moved it past pure imitation learning and towards something else.

HarHarVeryFunny · 2026-01-02T13:34:01 1767360841

Yes, if all you did was replace current pre/mid/post training with a new (elusive holy grail) runtime continual learning algorithm, then it would definitely still just be a language model. You seem to be talking about it having TWO runtime continual learning algorithms, next-token and long-horizon RL, but of course RL is part of what we're calling an LLM.

It's not obvious if you just did this without changing the learning objective from self-prediction (auto-regressive) to external prediction whether you'd actually gain much capability though. Auto-regressive training is what makes LLMs imitators - always trying to do same as before.

In fact, if you did just let a continual learner autonomously loose in some virtual environment, why would you expect it do do anything different, other than continual learning from whatever it was exposed to in the environment, from putting a current LLM in a loop, together with tool use as a way to expose it to new data? An imitative (auto-regressive) LLM doesn't have any drive to do anything new - if you just keep feeding it's own output back in as an input, then it's basically just a dynamical system that will eventually settle down into some attractor states representing the closure of the patterns it has learnt and is generating.

If you want the model to behave in a more human/animal-like self-motivated agentic fashion, then I think the focus has to be on learning how to act to control and take advantage of the semi-predictable environment, which is going to be based on having predicting the environment as the learning objective (vs auto-regressive), plus some innate drives (curiosity, boredom, etc) to bias behavior to maximize learning and creative discovery.

Continual learning also isn't going to magically solve the RL reward problem (how do you define and measure RL rewards in the general, non-math/programming, case?). In fact post-training is a very human-curated affair since humans have identified math and programming as tasks where this works and have created these problem-specific rewards. If you wanted the model to discover it's own rewards at runtime, as part of your new runtime RL algorithm perhaps, then you'd have to figure how to bake that into the architecture.

ACCount37 · 2026-01-02T18:07:57 1767377277

No. There are no architectural changes and no "second runtime learning algorithm". There's just the good old in-context learning that all LLMs get from pre-training. RLVR is a training stage that pressures the LLM to take advantage of it on real tasks.

"Runtime continual learning algorithm" is an elusive target of questionable desirability - given that we already have in-context learning, and "get better at SFT and RLVR lmao" is relatively simple to pull off and gives kickass gains in the here and now.

I see no reason why "behave in a more human/animal-like self-motivated agentic fashion" can't be obtained from more RLVR, if that's what you want to train your LLMs for.

tantony · 2026-01-02T18:57:29 1767380249

Claude Opus 4.5 has been a big step up for me personally, and I used to think Sonnet 3.5 was good. It is an amazing deal at $20.

Just yesterday, it helped me parse out and understand a research paper - complete with step-by-step examples (this one: https://research.nvidia.com/sites/default/files/pubs/2016-03...). I will now go ahead and implement it myself, possibly relegating some of the more grunt-work type tasks to Claude code.

Without it, I would have been struggling through the paper for days, wading through WGSL shader code and there would be a higher chance that I just give up on it since this is for a side project and not my $job.

It has been a major force multiplier just for learning things. I have had a $20 subscription for about a year now. I bump it up to the $100 plan if I happen to be working on some project that eats through the $20 allocation. This happens to be one such month. I will probably go back to the $20 plan after this month. I continue to get a lot of value out of it.

andai · 2026-01-01T09:06:01 1767258361

Re: yolo mode

I looked into docker and then realized the problem I'm actually trying to solve was solved in like 1970 with users and permissions.

I just made a agent user limited to its own home folder, and added my user to its group. Then I run Claude code etc as the agent user.

So it can only read write /home/agent, and it cannot read or write my files.

I add myself to agent group so I can read/write the agent files.

I run into permission issues sometimes but, it's pretty smooth for the most part.

Oh also I gave it root to a $3 VPS. It's so nice having a sysadmin! :) That part definitely feels a bit deviant though!

staeff777 · 2026-01-01T18:16:58 1767291418

I really like this idea and just tried some steps for myself. create user with homedir: sudo useradd -m agent add myself to agent group: sudo usermod -a -G agent $USER

Allow agent group to agent home dir: sudo chmod -R 770 /home/agent

Start a new shell with the group (or login/logoff): newgrp agent Now you should be able to change into the agent home.

Allow your user to sudo as agent: echo "$USER ALL=(agent) NOPASSWD: ALL" |sudo tee -a /etc/sudoers.d/$USER-as-agent now you can start your agent using sudo: sudo -u agent your_agent

works nice.

jillesvangurp · 2026-01-01T09:20:30 1767259230

I use a qemu vm for running codex cli in yolo mode and use simple ssh based git operations for getting code in and out of there. Works great. And you can also do fun things like let it loose on multiple git projects in one prompt. The vm can run docker as well which helps with containerized tests and other more complicated things. One thing I've started to observe is that you spend more time waiting for tool execution than for model inference. So having a fast local vm is better than a slower remote one.

andai · 2026-01-01T19:34:14 1767296054

Re: yolo mode

https://markdownpastebin.com/?id=1ef97add6ba9404b900929ee195...

My notes from back when I set this up! Includes instructions for using a GUI file explorer as the agent user. As well as setting up a systemd service to fix the permissions automatically.

(And a nice trick which shows you which GUI apps are running as which user...)

However, most of these are just workarounds for the permission issue I kept running into, which is that Claude Code would for some reason create files with incorrect permissions so that I couldn't read or write those files from my normal account.

If someone knows how to fix that, or if someone at Anthropic is reading, then most of this Rube Goldberg machine becomes unnecessary :)

some_developer · 2026-01-01T11:42:13 1767267733

Docker in docker, with opencode.

Opencode plus some scripts on host and in its container works well to run yolo and only see what it needs (via mounting). Has git tools but can't push etc. is thought how to run tests with the special container-in-container setup.

Including pre-configured MCPs, skills, etc.

The best part is that it just works for everyone on the team, big plus.

knicholes · 2026-01-01T16:07:26 1767283646

cgroups and namespaces

ogou · 2026-01-01T03:07:43 1767236863

This is a good tooling survey of the past year. I have been watching it as a developer re-entering the job market. The job descriptions closely parallel the timeline used in the post. That's bizarre to me because these approaches are changing so fast. I see jobs for "Skill and Langchain experts with production-grade 0>1 experience. Former founders preferred". That is an expertise that is just a few months old and startups are trying to build whole teams overnight with it. I'm sure January and February will have job postings for whatever gets released that week. It's all so many sand castles.

weatherlite · 2026-01-01T06:58:56 1767250736

> Skill and Langchain experts with production-grade 0>1 experience.

Also , it's just normal backend work - calling a bunch of APIs. What am I missing here?

XenophileJKO · 2026-01-01T08:52:54 1767257574

That is like saying training tensorflow models is just calling some APIs.

Actually making a system like this work seems easy, but isn't really.

(Though with the CURRENT generation or two of models it has gotten "pretty easy" I think. Before that, not so much.)

weatherlite · 2026-01-01T12:45:51 1767271551

No idea about training tenserflow models - is it super complex or is it just calling a couple of APIs ? Langchain is literally calling an API. Maybe you need to get good with prompting or whatever, but I don't see where the complexity lies. Please let me know.

andy99 · 2026-01-01T14:36:04 1767278164

Having used both Tensorflow (though I expect they mean PyTorch which is way more popular, and I have also used) and langchain, they are nothing alike.

They he ML frameworks are much closer to implementing the mathematics of neural networks, with some abstractions but much closer to the linear algebra level. It requires an understanding of the underlying theory.

Langchain is a suite of convenience functions for composing prompts to LLMs. I wouldn’t consider there to be some real domain knowledge one would need to use it. There is a learning curve but it’s about learning the different components rather than learning a whole new academic discipline.

HarHarVeryFunny · 2026-01-01T16:36:30 1767285390

There's a big difference between building an ML framework like Tensorflow or PyTorch (I built a Lua Torch-like one in C++ myself) and just using it to build/train a model.

Building the model may range from very simple if you are just recreating a standard architecture, or be a research endeavor if you are designing something completely new.

The difficulty/complexity of then training the model depends on what it is. For something simple like a CNN for image recognition, it's really just a matter of selecting a few hyperparameters and letting it rip. At the other end of the spectrum you've got LLMs where training (and coping with instabilities) is something of a black art, with RL training completely different from pre-training, and there is also the issue of designing/discovering a pre/mid/post training curriculum.

But anyways, the actual training part can be very simple, not requiring too much knowledge of what's going on under the hood, depending on the model.

ogou · 2026-01-01T22:50:45 1767307845

You're right, none of these new tools are disciplines. They are vendor specific approaches that are very recent. That's part of my overall point. Who is out there with 2+ years of very narrow tooling experience at another company at a senior level and is available for a rando startup (or desparate enterprise looking for bolt-on AI features) at a fraction of the pay? Not many, I'm sure. We can level up, do training, and maybe stand up a demo project. But that won't satisfy an ATS scan. It's unrealistic.

jennyholzer3 · 2026-01-02T14:18:08 1767363488

LLM addicts are cult members. Proficiency with buzz words is used to demonstrate status within the cult.

walthamstow · 2026-01-01T08:24:06 1767255846

Buzzwords.

waldrews · 2026-01-01T01:03:16 1767229396

Remember, back in the day, when a year of progress was like, oh, they voted to add some syntactic sugar to Java...

nrhrjrjrjtntbt · 2026-01-01T04:42:19 1767242539

More like 6 different new nosql databases and js frameworks.

dotancohen · 2026-01-01T06:18:38 1767248318

A Wordpress zero day and Linux not on the desktop. Netcraft confirms it.

crystal_revenge · 2026-01-01T05:04:11 1767243851

That must have been a long time back. Having lived through the time when web pages were served through CGI and mobile phones only existed in movies, when SVMs where the new hotness in ML and people would write about how weird NNs were, I feel like I've seen a lot more concrete progress in the last few decades than this year.

This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past. They're cool, but they were way cooler 4 years ago. We've taken big ideas like "agents" and "reinforcement learning" and basically stripped them of all meaning in order to claim progress.

I mean, do you remember Geoffrey Hinton's RBM talk at Google in 2010? [0] That was absolutely insane for anyone keeping up with that field. By the mid-twenty teens RBMs were already outdated. I remember when everyone was implementing flavors of RNNs and LSTMs. Karpathy's character 2015 RNN project was insane [1].

This comment makes me wonder if part of the hype around LLMs is just that a lot of software people simply weren't paying attention to the absolutely mind-blowing progress we've seen in this field for the last 20 years. But even ignoring ML, the world's of web development and mobile application development have gone through incredible progress over the last decade and a half. I remember a time when JavaScript books would have a section warning that you should never use JS for anything critical to the application. Then there's the work in theorem provers over the last decade... If you remember when syntactic sugar was progress, either you remember way further back than I do, or you weren't paying attention to what was happening in the larger computing world.

0. https://www.youtube.com/watch?v=VdIURAu1-aU

1. https://karpathy.github.io/2015/05/21/rnn-effectiveness/

HarHarVeryFunny · 2026-01-01T19:45:21 1767296721

> LLMs are literally technology that can only reproduce the past.

That's incorrect on many levels. They are drawing upon, and reproducing, language patterns from "the past", but they are combining those patterns in ways that may have never have been seen before. They may not be truly creative, but they are still capable of generating novel outputs.

> They're cool, but they were way cooler 4 years ago.

Maybe this year has been more about incremental progress with LLMs than the shock/coolness factor of talking to an LLM for the first time, but the utility of them, especially for programming, has dramatically increased this year, really in the last 6 months.

The improvement in "AI" image and video generation has also been impressive, to the point now that fake videos on YouTube can often only be identified as such by common sense rather that the fact that they don't look real.

Incremental improvement can often be more impressive that innovation, whose future importance can be hard to judge when it first appears. How many people read "Attention is all you need" in 2017 and thought "Wow! This is going to change the world!". Not even the authors of the paper thought that.

handoflixue · 2026-01-01T05:10:13 1767244213

> LLMs are literally technology that can only reproduce the past.

Funny, I've used them to create my own personalized text editor, perfectly tailored to what I actually want. I'm pretty sure that didn't exist before.

It's wild to me how many people who talk about LLM apparently haven't learned how to use them for even very basic tasks like this! No wonder you think they're not that powerful, if you don't even know basic stuff like this. You really owe it to yourself to try them out.

crystal_revenge · 2026-01-01T05:29:42 1767245382

> You really owe it to yourself to try them out.

I've worked at multiple AI startups in lead AI Engineering roles, both working on deploying user facing LLM products and working on the research end of LLMs. I've done collaborative projects and demos with a pretty wide range of big names in this space (but don't want to doxx myself too aggressively), have had my LLM work cited on HN multiple times, have LLM based github projects with hundreds of stars, appeared on a few podcasts talking about AI etc.

This gets to the point I was making. I'm starting to realize that part of the disconnect between my opinions on the state of the field and others is that many people haven't really been paying much attention.

I can see if recent LLMs are your first intro to the state of the field, it must feel incredible.

threethirtytwo · 2026-01-01T10:56:47 1767265007

Over half of HN still thinks it’s a stochastic parrot and that it’s just a glorified google search.

The change hit us so fast a huge number of people don’t understand how capable it is yet.

Also it certainly doesn’t help that it still hallucinates. One mistake and it’s enough to set someone against LLMs. You really need to push through that hallucinations are just the weak part of the process to see the value.

CamperBob2 · 2026-01-01T17:24:51 1767288291

The problem I see, over and over, is that people pose poorly-formed questions to the free ChatGPT and Google models, laugh at the resulting half-baked answers that are often full of errors and hallucinations, and draw conclusions about the technology as a whole.

Either that, or they tried it "last year" or "a while back" and have no concept of how far things have gone in the meantime.

It's like they wandered into a machine shop, cut off a finger or two, and concluded that their grandpa's hammer and hacksaw were all anyone ever needed.

habinero · 2026-01-01T21:05:13 1767301513

No, frankly it's the difference between actual engineers and hobbyists/amateurs/non-SWEs.

SWEs are trained to discard surface-level observations and be adversarial. You can't just look at the happy path, how does the system behave for edge cases? Where does it break down and how? What are the failure modes?

The actual analogy to a machine shop would be to look at whether the machines were adequate for their use case, the building had enough reliable power to run and if there were any safety issues.

It's easy to Clever Hans yourself and get snowed by what looks like sophisticated effort or flat out bullshit. I had to gently tell a junior engineer that just because the marketing claims something will work a certain way, that doesn't mean it will.

threethirtytwo · 2026-01-01T21:55:42 1767304542

What you’re describing is just competent engineering, and it’s already been applied to LLMs. People have been adversarial. That’s why we know so much about hallucinations, jailbreaks, distribution shift failures, and long-horizon breakdowns in the first place. If this were hobbyist awe, none of those benchmarks or red-teaming efforts would exist.

The key point you’re missing is the type of failure. Search systems fail by not retrieving. Parrots fail by repeating. LLMs fail by producing internally coherent but factually wrong world models. That failure mode only exists if the system is actually modeling and reasoning, imperfectly. You don’t get that behavior from lookup or regurgitation.

This shows up concretely in how errors scale. Ambiguity and multi-step inference increase hallucinations. Scaffolding, tools, and verification loops reduce them. Step-by-step reasoning helps. Grounding helps. None of that makes sense for a glorified Google search.

Hallucinations are a real weakness, but they’re not evidence of absence of capability. They’re evidence of an incomplete reasoning system operating without sufficient constraints. Engineers don’t dismiss CNC machines because they crash bits. They map the envelope and design around it. That’s what’s happening here.

Being skeptical of reliability in specific use cases is reasonable. Concluding from those failure modes that this is just Clever Hans is not adversarial engineering. It’s stopping one layer too early.

habinero · 2026-01-02T13:18:51 1767359931

> If this were hobbyist awe, none of those benchmarks or red-teaming efforts would exist.

Absolutely not true. I cannot express how strongly this is not true, haha. The tech is neat, and plenty of real computer scientists work on it. That doesn't mean it's not wildly misunderstood by others.

> Concluding from those failure modes that this is just Clever Hans is not adversarial engineering.

I feel like you're maybe misunderstanding what I mean when I refer to Clever Hans. The Clever Hans story is not about the horse. It's about the people.

A lot of people -- including his owner-- were legitimately convinced that a horse could do math, because look, literally anyone can ask the horse questions and it answers them correctly. What more proof do you need? It's obvious he can do math.

Except of course it's not true lol. Horses are smart critters, but they absolutely cannot do arithmetic no matter how much you train them.

The relevant lesson here is it's very easy to convince yourself you saw something you 100% did not see. (It's why magic shows are fun.)

CamperBob2 · 2026-01-02T15:32:53 1767367973

Except of course it's not true lol. Horses are smart critters, but they absolutely cannot do arithmetic no matter how much you train them.

These things are not horses. How can anyone choose to remain so ignorant in the face of irrefutable evidence that they're wrong?

https://arxiv.org/abs/2507.15855

It's as if a disease like COVID swept through the population, and every human's IQ dropped 10 to 15 points while our machines grew smarter to an even larger degree.

jheez3 · 2026-01-02T02:46:35 1767321995

I wish there was a way to discern posts from legit clever people from the not-so.

Its annoying to see posts from people who lag behind in intelligence and just dont get it - people learn at different rates. Some see way further ahead.

threethirtytwo · 2026-01-02T05:26:29 1767331589

A good way to filter is for you to look in the mirror. Only the person in the mirror sees further ahead than anyone else.

CamperBob2 · 2026-01-01T21:13:21 1767302001

You sound pretty certain. There's often good money to be made in taking the contrarian view, where you have insights that the so-called "smart money" lacks. What are some good investments to make in the extreme-bear case, in which we're all just Clever Hans-ing ourselves as you put it? Do you have skin in the game?

habinero · 2026-01-02T11:56:30 1767354990

My dude, I assure you "humans are really good at convincing themselves of things that are not true" is a very, very well known fact. I don't know what kind of arbitrage you think exists in this incredibly anodyne statement lol.

If you want a financial tip, don't short stock and chase market butterflies. Instead, make real professional friends, develop real skills and learn to be friendly and useful.

I made my money in tech already, partially by being lucky and in the right place at the right time, and partially because I made my own luck by having friends who passed the opportunity along.

Hope that helps!

CamperBob2 · 2026-01-01T05:36:55 1767245815

That's all very impressive, to be sure. But are you sure you're getting the point? As of 2025, LLMs are now very good at writing new code, creating new imagery, and writing original text. They continue to improve at a remarkable rate. They are helping their users create things that didn't exist before. Additionally, they are now very good at searching and utilizing web resources that didn't exist at training time.

So it is absurdly incorrect to say "they can only reproduce the past." Only someone who hasn't been paying attention (as you put it) would say such a thing.

windexh8er · 2026-01-01T07:30:50 1767252650

> They are helping their users create things that didn't exist before.

That is a derived output. That isn't new as in: novel. It may be unique but it is derived from training data. LLMs legitimately cannot think and thus they cannot create in that way.

ordersofmag · 2026-01-01T13:46:53 1767275213

I will find this often-repeated argument compelling only when someone can prove to me that the human mind works in a way that isn't 'combining stuff it learned in the past'.

5 years ago a typical argument against AGI was that computers would never be able to think because "real thinking" involved mastery of language which was something clearly beyond what computers would ever be able to do. The implication was that there was some magic sauce that human brains had that couldn't be replicated in silicon (by us). That 'facility with language' argument has clearly fallen apart over the last 3 years and been replaced with what appears to be a different magic sauce comprised of the phrases 'not really thinking' and the whole 'just repeating what it's heard/parrot' argument.

I don't think LLM's think or will reach AGI through scaling and I'm skeptical we're particularly close to AGI in any form. But I feel like it's a matter of incremental steps. There isn't some magic chasm that needs to be crossed. When we get there I think we will look back and see that 'legitimately thinking' wasn't anything magic. We'll look at AGI and instead of saying "isn't it amazing computers can do this" we'll say "wow, was that all there is to thinking like a human".

windexh8er · 2026-01-01T14:35:02 1767278102

> 5 years ago a typical argument against AGI was that computers would never be able to think because "real thinking" involved mastery of language which was something clearly beyond what computers would ever be able to do.

Mastery of words is thinking? In that line of argument then computers have been able to think for decades.

Humans don't think only in words. Our context, memory and thoughts are processed and occur in ways we don't understand, still.

There's a lot of great information out there describing this [0][1]. Continuing to believe these tools are thinking, however, is dangerous. I'd gather it has something to do with logic: you can't see the process and it's non-deterministic so it feels like thinking. ELIZA tricked people. LLMs are no different.

[0] https://archive.is/FM4y8 [0] https://www.theverge.com/ai-artificial-intelligence/827820/l... [1] https://www.raspberrypi.org/blog/secondary-school-maths-show...

CamperBob2 · 2026-01-01T16:59:12 1767286752

Mastery of words is thinking?

That's the crazy thing. Yes, in fact, it turns out that language encodes and embodies reasoning. All you have to do is pile up enough of it in a high-dimensional space, use gradient descent to model its original structure, and add some feedback in the form of RL. At that point, reasoning is just a database problem, which we currently attack with attention.

No one had the faintest clue. Even now, many people not only don't understand what just happened, but they don't think anything happened at all.

ELIZA, ROFL. How'd ELIZA do at the IMO last year?

meindnoch · 2026-01-01T19:31:12 1767295872

So people without language cannot reason? I don't think so.

CamperBob2 · 2026-01-01T19:53:05 1767297185

There's no such thing as people without language, except for infants and those who are so mentally incapacitated that the answer is self-evidently "No, they cannot."

Language is the substrate of reason. It doesn't need to be spoken or written, but it's a necessary and (as it turns out) sufficient component of thought.

windexh8er · 2026-01-01T22:36:00 1767306960

There are quite a few studies to refute this highly ignorant comment. I'd suggest some reading [0].

From the abstract: "Is thought possible without language? Individuals with global aphasia, who have almost no ability to understand or produce language, provide a powerful opportunity to find out. Astonishingly, despite their near-total loss of language, these individuals are nonetheless able to add and subtract, solve logic problems, think about another person’s thoughts, appreciate music, and successfully navigate their environments. Further, neuroimaging studies show that healthy adults strongly engage the brain’s language areas when they understand a sentence, but not when they perform other nonlinguistic tasks like arithmetic, storing information in working memory, inhibiting prepotent responses, or listening to music. Taken together, these two complementary lines of evidence provide a clear answer to the classic question: many aspects of thought engage distinct brain regions from, and do not depend on, language."

[0] https://pmc.ncbi.nlm.nih.gov/articles/PMC4874898/

CamperBob2 · 2026-01-02T01:34:06 1767317646

Yeah, you can prove pretty much anything with a pubmed link. Do dead salmon "think?" fMRI says maybe!

https://pmc.ncbi.nlm.nih.gov/articles/PMC2799957/

The resources that the brain is using to think -- whatever resources those are -- are language-based. Otherwise there would be no way to communicate with the test subjects. "Language" doesn't just imply written and spoken text, as these researchers seem to assume.

emp17344 · 2026-01-02T17:24:06 1767374646

There’s linguistic evidence that, while language influences thought, it does not determine thought - see the failure of the strong Sapir-Whorf hypothesis. This is one of the most widely studied and robust linguistic results - we actually know for a fact that language does not determine or define thought.

CamperBob2 · 2026-01-02T17:53:20 1767376400

How's the replication rate in that field? Last I heard it was below 50%.

How can you think without tokens of some sort? That's half of the question that has to be answered by the linguists. The other half is that if language isn't necessary for reasoning, what is?

We now know that a conceptually-simple machine absolutely can reason with nothing but language as inputs for pretraining and subsequent reinforcement. We didn't know that before. The linguists (and the fMRI soothsayers) predicted none of this.

emp17344 · 2026-01-02T18:00:45 1767376845

Read about linguistic history and make up your own mind, I guess. Or don’t, I don’t care. You’re dismissing a series of highly robust scientific results because they fail to validate your beliefs, which is highly irrational. I'm no longer interested in engaging with you.

windexh8er · 2026-01-01T23:10:11 1767309011

> ELIZA, ROFL. How'd ELIZA do at the IMO last year?

What's funny is the failure to grasp any contextual framing of ELIZA. When it came out people were impressed by it's reasoning, it's responses. And in your line of defense it could think because it had mastery of words!

But fast forward the current timeline 30 years. You will have been of the same camp that argued on behalf of ELIZA when the rest of the world was asking, confusingly: how did people think ChatGPT could think?

CamperBob2 · 2026-01-02T01:52:40 1767318760

No one was impressed with ELIZA's "reasoning" except for a few non-specialist test subjects recruited from the general population. Admittedly it was disturbing to see how strongly some of those people latched onto it.

Meanwhile, you didn't answer my question. How'd ELIZA do on the IMO? If you know a way to achieve gold-medal performance at top-level math and programming competitions without thinking, I for one am all ears.

arcatech · 2026-01-01T14:46:32 1767278792

> I will find this often-repeated argument compelling only when someone can prove to me that the human mind works in a way that isn't 'combining stuff it learned in the past'.

This is the definition of the word ‘novel’.

Kerrick · 2026-01-01T08:02:50 1767254570

That is a pedantic distinction. You can create something that didn't exist by combining two things that did exist, in a way of combining things that already existed. For example, you could use a blender to combine almond butter and sawdust. While this may not be "novel", and it may be derived from existing materials and methods, you may still lay claim to having created something that didn't exist before.

For a more practical example, creating bindings from dynamic-language-A for a library in compiled-language-B is a genuinely useful task, allowing you to create things that didn't exist before. Those things are likely to unlock great happiness and/or productivity, even if they are derived from training data.

windexh8er · 2026-01-01T14:41:52 1767278512

> That is a pedantic distinction. You can create something that didn't exist by combining two things that did exist, in a way of combining things that already existed.

This is the definition of a derived product. Call it a derivative work if we're being pedantic and, regardless, is not any level of proof that LLMs "think".

threethirtytwo · 2026-01-01T16:59:43 1767286783

Pedantic and not true. The LLM has stochastic processes involved. Randomness. That’s not old information. That’s newly generated stuff.

jama211 · 2026-01-01T09:07:20 1767258440

Yeah you’ve lost me here I’m sorry. In the real world humans work with AI tools to create new things. What you’re saying is the equivalent of “when a human writes a book in English, because they use words and letters that already exist and they already know they aren’t creating anything new”.

nl · 2026-01-01T11:09:58 1767265798

What does "think" mean?

Why is that kind of thinking required to create novel works?

Randomness can create novelty.

Mistakes can be novel.

There are many ways to create novelty.

Also I think you might not know how LLMs are trained to code. Pre-training gives them some idea of the syntax etc but that only gets you to fancy autocomplete.

Modern LLMs are heavily trained using reinforcement data which is custom task the labs pay people to do (or by distilling another LLM which has had the process performed on it).

zingar · 2026-01-01T08:07:45 1767254865

Could you give us an idea of what you’re hoping for that is not possible to derive from training data of the entire internet and many (most?) published books?

techpression · 2026-01-01T08:42:22 1767256942

This is the problem, the entire internet is a really bad set of training data because it’s extremely polluted.

Also the derived argument doesn’t really hold, just because you know about two things doesn’t mean you’d be able to come up with the third, it’s actually very hard most of the time and requires you to not do next token prediction.

threethirtytwo · 2026-01-01T11:01:20 1767265280

The emergent phenomenon is that the LLM can separate truth from fiction when you give it a massive amount of data. It can figure the world out just as we can figure it out when we are as well inundated with bullshit data. The pathways exist in the LLM but it won’t necessarily reveal that to you unless you tune it with RL.

ahtihn · 2026-01-01T13:23:23 1767273803

> The emergent phenomenon is that the LLM can separate truth from fiction when you give it a massive amount of data.

I don't believe they can. LLMs have no concept of truth.

What's likely is that the "truth" for many subjects is represented way more than fiction and when there is objective truth it's consistently represented in similar way. On the other hand there are many variations of "fiction" for the same subject.

threethirtytwo · 2026-01-01T16:50:33 1767286233

They can and we have definitive proof. When we tune LLM models with reinforcement learning the models end up hallucinating less and becoming more reliable. Basically in a nut shell we reward the model when telling the truth and punish it when it’s not.

So think of it like this, to create the model we use terabytes of data. Then we do RL which is probably less than one percent of additional data involved in the initial training.

The change in the model is that reliability is increased and hallucinations are reduced at a far greater rate than one percent. So much so that modern models can be used for agentic tasks.

How can less than one percent of reinforcement training get the model to tell the truth greater than one percent of the time?

The answer is obvious. It ALREADY knew the truth. There’s no other logical way to explain this. The LLM in its original state just predicts text but it doesn’t care about truth or the kind of answer you want. With a little bit of reinforcement it suddenly does much better.

It’s not a perfect process and reinforcement learning often causes the model to be deceptive an not necessarily tell the truth but it more gives an answer that may seem like the truth or an answer that the trainer wants to hear. In general though we can measurably see a difference in truthfulness and reliability to an extent far greater than the data involved in training and that is logical proof it knows the difference.

Additionally while I say it knows the truth already this is likely more of a blurry line. Even humans don’t fully know the truth so my claim here is that an LLM knows the truth to a certain extent. It can be wildly off for certain things but in general it knows and this “knowing” has to be coaxed out of the model through RL.

Keep in mind the LLM is just auto trained on reams and reams of data. That training is massive. Reinforcement training is done on a human basis. A human must rate the answers so it is significantly less.

habinero · 2026-01-01T20:46:45 1767300405

> The answer is obvious. It ALREADY knew the truth. There’s no other logical way to explain this.

I can think of several offhand.

1. The effect was never real, you've just convinced yourself it is because you want it to be, ie you Clever Hans'd yourself.

2. The effect is an artifact of how you measure "truth" and disappears outside that context ("It can be wildly off for certain things")

3. The effect was completely fabricated and is the result of fraud.

If you want to convince me that "I threatened a statistical model with a stick and it somehow got more accurate, therefore it's both intelligent and lying" is true, I need a lot less breathless overcredulity and a lot more "I have actively tried to disprove this result, here's what I found"

threethirtytwo · 2026-01-01T21:46:12 1767303972

You asked for something concrete, so I’ll anchor every claim to either documented results or directly observable training mechanics.

First, the claim that RLHF materially reduces hallucinations and increases factual accuracy is not anecdotal. It shows up quantitatively in benchmarks designed to measure this exact thing, such as TruthfulQA, Natural Questions, and fact verification datasets like FEVER. Base models and RL-tuned models share the same architecture and almost identical weights, yet the RL-tuned versions score substantially higher. These benchmarks are external to the reward model and can be run independently.

Second, the reinforcement signal itself does not contain factual information. This is a property of how RLHF works. Human raters provide preference comparisons or scores, and the reward model outputs a single scalar. There are no facts, explanations, or world models being injected. From an information perspective, this signal has extremely low bandwidth compared to pretraining.

Third, the scale difference is documented by every group that has published training details. Pretraining consumes trillions of tokens. RLHF uses on the order of tens or hundreds of thousands of human judgments. Even generous estimates put it well under one percent of the total training signal. This is not controversial.

Fourth, the improvement generalizes beyond the reward distribution. RL-tuned models perform better on prompts, domains, and benchmarks that were not part of the preference data and are evaluated automatically rather than by humans. If this were a Clever Hans effect or evaluator bias, performance would collapse when the reward model is not in the loop. It does not.

Fifth, the gains are not confined to a single definition of “truth.” They appear simultaneously in question answering accuracy, contradiction detection, multi-step reasoning, tool use success, and agent task completion rates. These are different evaluation mechanisms. The only common factor is that the model must internally distinguish correct from incorrect world states.

Finally, reinforcement learning cannot plausibly inject new factual structure at scale. This follows from gradient dynamics. RLHF biases which internal activations are favored, it does not have the capacity to encode millions of correlated facts about the world when the signal itself contains none of that information. This is why the literature consistently frames RLHF as behavior shaping or alignment, not knowledge acquisition.

Given those facts, the conclusion is not rhetorical. If a tiny, low-bandwidth, non-factual signal produces large, general improvements in factual reliability, then the information enabling those improvements must already exist in the pretrained model. Reinforcement learning is selecting among latent representations, not creating them.

You can object to calling this “knowing the truth,” but that’s a semantic move, not a substantive one. A system that internally represents distinctions that reliably track true versus false statements across domains, and can be biased to express those distinctions more consistently, functionally encodes truth.

Your three alternatives don’t survive contact with this. Clever Hans fails because the effect generalizes. Measurement artifact fails because multiple independent metrics move together. Fraud fails because these results are reproduced across competing labs, companies, and open-source implementations.

If you think this is still wrong, the next step isn’t skepticism in the abstract. It’s to name a concrete alternative mechanism that is compatible with the documented training process and observed generalization. Without that, the position you’re defending isn’t cautious, it’s incoherent.

CamperBob2 · 2026-01-02T15:39:15 1767368355

Your three alternatives don’t survive contact with this. Clever Hans fails because the effect generalizes. Measurement artifact fails because multiple independent metrics move together. Fraud fails because these results are reproduced across competing labs, companies, and open-source implementations.

He doesn't care. You might as well be arguing with a Scientologist.

closewith · 2026-01-01T12:16:23 1767269783

By that definition, nearly all commercial software development (and nearly all human output in general) is derived output.

windexh8er · 2026-01-01T14:56:08 1767279368

Wow.

You’re using ‘derived’ to imply ‘therefore equivalent.’ That’s a category error. A cookbook is derived from food culture. Does an LLM taste food? Can it think about how good that cookie tastes?

A flight simulator is derived from aerodynamics - yet it doesn’t fly.

Likewise, text that resembles reasoning isn’t the same thing as a system that has beliefs, intentions, or understanding. Humans do. LLMs don't.

Also... Ask an LLM what's the difference between a human brain and an LLM. If an LLM could "think" it wouldn't give you the answer it just did.

CamperBob2 · 2026-01-01T16:52:41 1767286361

Ask an LLM what's the difference between a human brain and an LLM. If an LLM could "think" it wouldn't give you the answer it just did.

I imagine that sounded more profound when you wrote it than it did just now, when I read it. Can you be a little more specific, with regard to what features you would expect to differ between LLM and human responses to such a question?

Right now, LLM system prompts are strongly geared towards not claiming that they are humans or simulations of humans. If your point is that a hypothetical "thinking" LLM would claim to be a human, that could certainly be arranged with an appropriate system prompt. You wouldn't know whether you were talking to an LLM or a human -- just as you don't now -- but nothing would be proved either way. That's ultimately why the Turing test is a poor metric.

windexh8er · 2026-01-01T23:19:47 1767309587

> Right now, LLM system prompts are strongly geared towards not claiming that they are humans or simulations of humans. If your point is that a hypothetical "thinking" LLM would claim to be a human, that could certainly be arranged with an appropriate system prompt. You wouldn't know whether you were talking to an LLM or a human -- just as you don't now -- but nothing would be proved either way. That's ultimately why the Turing test is a poor metric.

The mental gymnastics here is entertainment at best. Of course the thinking LLM would give feedback on how it's actually just a pattern model over text - well, we shouldn't believe that! The LLM was trained to lie about it's true capabilities in your own admission?

How about these...

What observable capability would you expect from "true cognitive thought" that a next-token predictor couldn’t fake?

Where are the system’s goals coming from—does it originate them, or only reflect the user/prompt?

How does it know when it’s wrong without an external verifier? If the training data says X and the answer is Y - how will it ever know it was wrong and reach the correct conclusion?

CamperBob2 · 2026-01-02T01:47:30 1767318450

How does it know when it’s wrong without an external verifier? If the training data says X and the answer is Y - how will it ever know it was wrong and reach the correct conclusion?

You need to read a few papers with publication dates after 2023.

closewith · 2026-01-01T16:55:50 1767286550

You’re arguing against a straw man. No one is claiming LLMs have beliefs, intentions, or understanding. They don’t need them to be economically useful.

windexh8er · 2026-01-01T22:28:50 1767306530

Oh yes, they are.

And beyond people claiming that LLMs are basically sentient you have people like CamperBob2 who made this wild claim:

"""There's no such thing as people without language, except for infants and those who are so mentally incapacitated that the answer is self-evidently "No, they cannot."

Language is the substrate of reason. It doesn't need to be spoken or written, but it's a necessary and (as it turns out) sufficient component of thought."""

Let that sink. They literally think that there's no such thing as people without language. Talk about a wild and ignorant take on life in general!

CamperBob2 · 2026-01-02T01:48:27 1767318507

How'd they communicate with the test subjects?

That's "language."

weatherlite · 2026-01-01T07:15:20 1767251720

> So it is absurdly incorrect to say "they can only reproduce the past."

Also , a shitton of what we do economically is reproducing the past with slight tweaks and improvements. We all do very repetitive things and these tools cut the time / personnel needed by a significant factor.

crystal_revenge · 2026-01-01T06:07:27 1767247647

I think the confusion is people's misunderstanding of what 'new code' and 'new imagery' mean. Yes, LLMs can generate a specific CRUD webapp that hasn't existed before but only based on interpolating between the history of existing CRUD webapps. I mean traditional Markov Chains can also produce 'new' text in the sense that "this exact text" hasn't been seen before, but nobody would argue that traditional Markov Chains aren't constrained by "only producing the past".

This is even more clear in the case of diffusion models (which I personally love using, and have spent a lot of time researching). All of the "new" images created by even the most advanced diffusion models are fundamentally remixing past information. This is really obvious to anyone who has played around with these extensively because they really can't produce truly novel concepts. New concepts can be added by things like fine-tuning or use of LoRAs, but fundamentally you're still just remixing the past.

LLMs are always doing some form of interpolation between different points in the past. Yes they can create a "new" SQL query, but it's just remixing from the SQL queries that have existed prior. This still makes them very useful because a lot of engineering work, including writing a custom text editor, involve remixing existing engineering work. If you could have stack-overflowed your way to an answer in the past, an LLM will be much superior. In fact, the phrase "CRUD" largely exists to point out that most webapps are fundamentally the same.

A great example of this limitation in practice is the work that Terry Tao is doing with LLMs. One of the largest challenges in automated theorem proving is translating human proofs into the language of a theorem prover (often Lean these days). The challenge is that there is not very much Lean code currently available to LLMs (especially with the necessary context of the accompanying NL proof), so they struggle to correctly translate. Most of the research in this area is around improving LLM's representation of the mapping from human proofs to Lean proofs (btw, I personally feel like LLMs do have a reasonably good chance of providing major improvements in the space of formal theorem proving, in conjunction with languages like Lean, because the translation process is the biggest blocker to progress).

When you say:

> So it is absurdly incorrect to say "they can only reproduce the past."

It's pretty clear you don't have a solid background in generative models, because this is fundamentally what they do: model an existing probability distribution and draw samples from that. LLMs are doing this for a massive amount of human text, which is why they do produce some impressive and useful results, but this is also a fundamental limitation.

But a world where we used LLMs for the majority of work, would be a world with no fundamental breakthroughs. If you've read The Three Body Problem, it's very much like living in the world where scientific progress is impeded by sophons. In that world there is still some progress (especially with abundant energy), but it remains fundamentally and deeply limited.

PeterHolzwarth · 2026-01-01T06:35:15 1767249315

Just an innocent bystander here, so forgive me, but I think the flack you are getting is because you appear to be responding to claims that these tools will reinvent everything and introduce a new halcyon age of creation - when, at least on hacker news, and definitely in this thread, no one is really making such claims.

Put another way, and I hate to throw in the now over-used phrase, but I feel you may be responding to a strawman that doesn't much appear in the article or the discussion here: "Because these tools don't achieve a god-like level of novel perfection that no one is really promising here, I dismiss all this sorta crap."

Especially when I think you are also admitting that the technology is a fairly useful tool on its own merits - a stance which I believe represents the bulk of the feelings that supporters of the tech here on HN are describing.

I apologize if you feel I am putting unrepresentative words in your mouth, but this is the reading I am taking away from your comments.

signatoremo · 2026-01-01T07:54:07 1767254047

Lot of impressive points. They are also irrelevant. The majority of people also only extrapolate from the knowledge they acquired in the past. That’s why there is the concept of inventor, someone who comes up with new ideas. Many new inventions are also based on existing ideas. Is that the reason to dismiss those achievements?

Do you only take LLM seriously if it can be another Einstein?

> But a world where we used LLMs for the majority of work, would be a world with no fundamental breakthroughs.

What do you consider recent fundamental breakthroughs?

Even if you are right, human can continue to work on hard problems while letting LLM handle the majority of derivative work

aoeusnth1 · 2026-01-01T19:01:26 1767294086

> It's pretty clear you don't have a solid background in generative models, because this is fundamentally what they do: model an existing probability distribution and draw samples from that.

After post-training, this is definitively NOT what an LLM does.

oedemis · 2026-01-01T15:31:04 1767281464

as architectures evolve, i think it can be that we learn more "side effects".. back in 2020 openai researchers said "GPT-3 is applied without any gradient updates or fine-tuning" the model emerges at a certain level of scale...

throwaway7783 · 2026-01-01T06:21:58 1767248518

Would you say that LLMs can discover patterns hitherto unknown? It would still be generating from the past, but patterns/connections not made before.

uxcolumbo · 2026-01-01T08:15:55 1767255355

How do human brains create something novel and what will it take for AIs to do the same?

threethirtytwo · 2026-01-01T11:24:05 1767266645

> It's pretty clear you don't have a solid background in generative models, because this is fundamentally what they do

You don’t have a solid background. No one does. We fundamentally don’t understand LLMs, this is an industry and academic opinion. Sure there are high level perspectives and analogies we can apply to LLMs and machine learning in general like probability distributions, curve fitting or interpolations… but those explanations are so high level that they can essentially be applied to humans as well. At a lower level we cannot describe what’s going on. We have no idea how to reconstruct the logic of how an LLM arrived at a specific output from a specific input.

It is impossible to have any sort of deterministic function, process or anything produce new information from old information. This limitation is fundamental to logic and math and thus it will limit human output as well.

You can combine information you can transform information you can lose information. But producing new information from old information from deterministic intelligence is fundamentally impossible in reality and therefore fundamentally impossible for LLMs and humans. But note the keyword: “deterministic”

New information can literally only arise through stochastic processes. That’s all you have in reality. We know it’s stochastic because determinism vs. stochasticism are literally your only two viable options. You have a bunch of inputs, the outputs derived from it are either purely deterministic transformations or if you want some new stuff from the input you must apply randomness. That’s it.

That’s essentially what creativity is. There is literally no other logical way to generate “new information”. Purely random is never really useful so “useful information” arrives only after it is filtered and we use past information to filter the stochastic output and “select” something that’s not wildly random. We also only use randomness to perturb the output a little bit so it’s not too crazy.

In the end it’s this selection process and stochastic process combined that forms creativity. We know this is a general aspect of how creativity works because there’s literally no other way to do it.

LLMs do have stochastic aspects to them so we know for a fact it is generating new things and not just drawing on the past. We know it can fit our definition of “creative” and we can literally see it be creative in front of your eyes.

You’re ignoring what you see with your eyes and drawing your conclusions from a model of LLMs that isn’t fully accurate. Or you’re not fully tying the mechanisms of how LLMs work with what creativity or generating new data from past data is in actuality.

The fundamental limitation with LLMs is not that it can’t create new things. It’s that the context window is too small to create new things beyond that. Whatever it can create it is limited to the possibilities within that window and that sets a limitation on creativity.

What you see happening with LEAN can also be an issue with the context window being too small. If we have an LLM with a giant context window bigger than anything before… and pass it all the necessary data to “learn” and be “trained” on lean it can likely start to produce new theorems without literally being “trained”.

Actually I wouldn’t call this a “fundamental” problem. More fundamental is the aspect of hallucinations. The fact that LLMs produce new information from past information in the WRONG way. Literally making up bullshit out of thin air. It’s the opposite problem of what you’re describing. These things are too creative and making up too much stuff.

We have hints that LLMs know the difference between hallucinations and reality but coaxing it to communicate that differentiation to us is limited.

jheez3 · 2026-01-02T02:52:32 1767322352

"You don’t have a solid background.

If you want to go around huffing and puffing your chest about a subject area, you kinda do fella. Credibility.

threethirtytwo · 2026-01-02T05:23:26 1767331406

Not only is what he saying in direct contradiction to what people with credibility have said, but his claimed credentials can be utter bullshit.

This is the internet bro. Credibility is irrelevant because identities can never be verified. So the only thing that matters is the strength and rationality of an argument.

That’s the point of hacker news substantive content not some battle of comparison of credentials or useless quips (like yours) with zero substance. Say something worth reading if you have anything to say at all, otherwise nobody cares.

jheez3 · 2026-01-02T11:32:11 1767353531

[flagged]

threethirtytwo · 2026-01-02T11:54:27 1767354867

Strong point. I especially liked the part where you addressed literally anything.

handoflixue · 2026-01-01T05:42:57 1767246177

Seriously, all that familiarity and you think an LLM "literally" can't invent anything that didn't already exist?

Like, I'm sorry, but you're just flat-out wrong and I've got the proof sitting on my hard drive. I use this supposedly impossible program daily.

windexh8er · 2026-01-01T07:26:04 1767252364

Do you also think LLMs "think"?

From what you've described an LLM has not invented anything. LLMs that can reason have a bit more slight of hand but they're not coming up with new ideas outside of the bounds of what a lot of words have encompassed in both fiction and non.

Good for you that you've got a fun token of code that's what you've always wanted, I guess. But this type of fantasy take on LLMs seems to be more and more prevalent as of late. A lot of people defending LLMs as if they're owed something because they've built something or maybe people are getting more and more attached to them from the conversational angle. I'm not sure, but I've run across more people in 2025 that are way too far in the deep end of personifying their relationships with LLMs.

Kerrick · 2026-01-01T08:05:43 1767254743

Hang on, you're now saying that if something has ever been described in fiction it doesn't count as invention? So if somebody literally developed a working photon torpedo, that isn't new because "Star Trek Did It"?

phatfish · 2026-01-01T09:21:44 1767259304

Is there any danger an LLM is going to create a working photo torpedo?

ben_w · 2026-01-01T10:03:22 1767261802

Well, they can use tools, and tools includes physics simulations, so if it is possible (and FWIW the tool-free "intuition" of ChatGPT is "there will never be an age of antimatter"), then why couldn't LLMs grind those tools to get a solution?

windexh8er · 2026-01-01T15:41:33 1767282093

You seem to be pretty far down the rabbit hole. How about this... You task an LLM to create a photon torpedo. If it can truly think then it should be able to provide you with something tangible. When you've got that in hand let us all know.

Back to the land of reality... Describing something in fiction doesn’t magically make it "not an invention". Fiction can anticipate an idea, but invention is about producing a working, testable implementation and usually involves novel technical methods. "Star Trek did it" is at most prior art for the concept, not a blueprint for the mechanism. If you can't understand that differential then maybe go ask an LLM.

Kerrick · 2026-01-01T15:48:13 1767282493

I didn't say anything about an LLM. I said "somebody" not "some predictive text engine."

9rx · 2026-01-01T13:31:13 1767274273

When a computer is able to invent things, we’ve achieved AGI. Do you believe we are already in the AGI era, or is the inventor in this case actually you?

bigyabai · 2026-01-01T06:05:02 1767247502

FWIW, your "evidence" is a text editor. I'm glad you made a tool that works for you, but the parent's point stands; this is a 200-level course-curriculum homework assignment. Tens of thousands of homemade editors exist, in various states of disrepair and vain overengineering.

least · 2026-01-01T06:27:48 1767248868

The difference between those is the person is actually using this text editor that they built with the help of LLMs. There's plenty of people creating novel scripts and programs that can accommodate their own unique specifications.

If a programmer creating their own software (or contracting it out to a developer) would be a bespoke suit and using software someone or some company created without your input is an off the rack suit, I'd liken these sorts of programs as semi-bespoke, or made to measure.

"LLMs are literally technology that can only reproduce the past" feels like an odd statement. I think the point they're going for is that it's not thinking and so it's not going to produce new ideas like a human would? But literally no technology does that. That is all derived from some human beings being particularly clever.

LLMs are tools. They can enable a human to create new things because they are interfacing with a human to facilitate it. It's merging the functional knowledge and vision of a person and translating it into something else.

resize2996 · 2026-01-01T18:32:41 1767292361

compilers can only produce machine code. so unorginal.

ctxc · 2026-01-01T09:37:49 1767260269

Some people cannot be convinced simply because their expectation of "novel" is something that appears in an Asimov novel.

I for one think your work is pretty cool - even though I haven't seen it, using something you built everyday is a claim not many can make!

Greduan · 2026-01-01T07:39:19 1767253159

Text editors in a thousand flavours has indeed already been programmed though. I don't think you understood what op meant.

Curious, does it perform at the limit of the hardware? Was it programmed in a tools language (like C++, Rust, C, etc.) or in a web tech?

zingar · 2026-01-01T08:15:25 1767255325

What is the point that you believe would be demonstrated by a new text editor running at the limit of hardware in a compiled editor? Would that point apply to every other text editor that exists already?

fmbb · 2026-01-01T08:39:38 1767256778

Is your new text editor open source?

nsxwolf · 2026-01-01T20:34:24 1767299664

The LLM didn't invent any new technology to do that, though. You used the LLM to reorganize Lego building blocks of knowledge into something new.

Without you, there was nothing.

waldrews · 2026-01-01T06:59:00 1767250740

I'm being hyperbolic of course, but I'm a little dismissive of the progress that happened since the days of BBS's and car based cell phones - we just got more connectivity, more capacity, more content, bigger/faster. Likewise, my attitude toward machine learning before 2023 is a smug 'heh, these computer scientists are doing undisciplined statistics at scale, how nice for them.' Then all of a sudden the machines woke up and started arguing with me, coherently, even about niche topics I have a PhD in. I can appreciate in retrospect how much of the machine learning progress ultimately went into that, but, like fusion, the magic payoff was supposed to be decades away and always remain decades away. This wasn't supposed to happen in my lifetime. 2025 progress isn't the 2023 shock, but this was the year LLM's-as-programmers (and LLM's-as-mathematicians, and...) went from 'isn't that cute, the machine is trying' to 'an expert with enough time would make better choices than the machine did,' and that makes for a different world. More so than, going from a Commodore Vic 20 with 4k of RAM and a modem to the latest Macbook.

ako · 2026-01-01T10:29:30 1767263370

> This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past.

Is this such a big limitation? Most jobs are basically people trained on past knowledge applying it today. No need to generate new knowledge.

And a lot of new knowledge is just combining 2 things from the past in a new way.

mr_toad · 2026-01-01T23:46:42 1767311202

Most people are capable of long-term learning. Some people are capable of discovering and inventing new things. I think the two are related, and current NN architecture doesn’t allow this. An AI that can cobble together a CRUD application to spec is one thing. An AI that can come up with a new idea for a successful app on its own is a completely different ball game.

throwup238 · 2026-01-01T01:08:09 1767229689

> they voted to add some syntactic sugar to Java...

I remember when we just wanted to rewrite everything in Rust.

Those were the simpler times, when crypto bros seemed like the worst venture capitalism could conjure.

OGEnthusiast · 2026-01-01T01:27:03 1767230823

Crypto bros in hindsight were so much less dangerous than AI bros. At least they weren't trying to construct data centers in rural America or prop up artificial stocks like $NVDA.

SauntSolaire · 2026-01-01T02:43:12 1767235392

Instead they were building crypto mining warehouses in rural America and propping up artificial currencies like BTC.

ryandrake · 2026-01-01T04:48:42 1767242922

Crazy how the two most hyped and funded technologies of the decade were: energy wasting fake money for criminals and energy wasting plagiarism machines.

quaintpartridge · 2026-01-01T02:12:55 1767233575

They were, just not as many. https://www.wired.com/story/the-worlds-biggest-bitcoin-mine-...

zahlman · 2026-01-01T02:50:53 1767235853

Speaking of which, we never found out the details (strike price/expiration) of Michael Burry's puts, did we? It seems he could have made bank if he'd waited one more month...

kamranjon · 2026-01-01T03:25:54 1767237954

I think they expire in March 2026 if the NVIDIA stock drops to $140 a share? Something close to that I think.

mgfist · 2026-01-01T05:26:53 1767245213

It's funny how people complain about the rust belt dying and factories leaving rural communities and so on, then when someone wants to build something that can provide jobs and tax revenue, everyone complains.

jakeydus · 2026-01-01T05:42:43 1767246163

How many people are employed at the average data center? A few dozen? Versus a steel mill, that’s nothing. A chicken plant in Nebraska closed down this last month. 3200 people lost their jobs. You think Meta will fill it with GPUs and the whole town will have jobs again?

scotty79 · 2026-01-01T07:25:51 1767252351

Many more are employed while building it. And they will never stop building. It's modern version of rail. But instead of distances it will cover the area.

uxcolumbo · 2026-01-01T08:26:45 1767256005

Will local folks get those jobs to build the data center?

And if so, what happens to those builders once the data center is built?