Looming Liability Machines (LLMs)

efitz · on Aug 25, 2024

Why should I think that LLMs would be good at the task of analyzing a cloud incident and determining root cause?

LLMs are good at predicting the next word in written language. They are generative; they make new text given a prompt. LLMs do not have base sets of facts about how complex systems work, and do not attempt to reason over a corpus of evidence and facts. as a result, I would expect that an LLM might concoct an interesting story about why such a failure occurred, and it might even be a convincing story if it happened to weave bits of context, accurately into the storyline. It might even, purely randomly, generate a story that actually correctly diagnosed the root cause of the failure, but that would be coincidental based on the similarity of the prompt to text of similar postmortem discussions that were part of its training set.

If you had an extremely detailed postmortem document, then I would expect LLM‘s to do a very good job of summarizing such document.

But I don’t see why an LLM is an appropriate tool for analyzing failures in complex systems; just as I don’t see a hammer being a very effective tool for tightening bolts.

Right now, I am concerned that the relative ease that modern frameworks provide to author LLM based applications, is leading many people to optimistically include LLM technology in attempts to solve problems that it doesn’t seem particularly well suited to solve.

reissbaker · on Aug 25, 2024

TBQH I'd bet the following would yield pretty good results:

1. Take your existing incident reporting / review docs (you have those, right?) that cover everything including 5-why incident reporting and analysis.

2. Fine-tune a Llama-3.1-70b [1] LoRA on the data associated with the outage as input, and the root cause analysis as the output

3. Tada! You have a state-of-the-art custom LLM that is good at analyzing your outages and guessing what the root causes might be.

It's a little shocking sometimes to me how underutilized fine-tuning is. Most of the "learning" happening in "machine learning" is in training — yes, it's definitely true that LLMs can exhibit a surprising amount of "in-context learning" via prompts, but it's a surprising because learning during training is so much more powerful, and it's surprising that in-context works at all. Honestly even just fine-tuning an 8b model — which is totally doable on a 3090/4090 — can yield SOTA results on task-specific performance. It's so much better than prompting!

1: I mean you could also finetune 405b, but you'll need a lot of GPUs both to train it and to run it. In my experience (public benchmarks be damned; the models all tend to saturate the public benchmarks, even though everyone claims not to train on them), 70b on internal evals tends to perform similarly to gpt-4o, and with OSS models you have somewhat more ownership and control.

Infinity315 · on Aug 25, 2024

I take issue with this.

Typically issues arise because they are novel and are unforeseen. If we did see these issues beforehand they'd be fixed! LLMs by definition are trained by example, so I fail to see how finetuning LLMs on things that have already happened to be helpful for determining the root cause of a novel issue.

LLMs seem to lack systemic modelling that humans do. I can see LLMs being practical for this if it is shown that LLMs are capable of modelling scenarios outside of their dataset, but thus far none such examples exist.

reissbaker · on Aug 25, 2024

During my time taking on-call pager rotations at fairly large engineering organizations, I'd say that a distressingly large number of incidents happened for fairly standard reasons ("didn't write tests for this case" + "deployed" + "insufficient monitoring," with things like incorrect concurrency assumptions around DB access, or memory leaks in application code, or insufficient rate limiting / circuitbreaking causing cascading failures being fairly common passengers). In fact, it's actually pretty hard for me to come up with an issue that wasn't basically a fairly common programming error combined with some relatively common infrastructure, build, test, and/or observability problem. Sometimes with some common bureaucratic human problems thrown into the mix too, i.e. "no one owns this service's uptime."

Infinity315 · on Aug 25, 2024

I think we're at an epistemic impasse here. At what point would/could you be convinced that LLMs are incapable or unsuited here? If LLMs were successfully deployed in a production environment is the day I bite my tongue. What about you?

I'm not even sure that LLMs are even capable of solving standard bugs see: [1]. Hallucination seems to be a significant hurdle and any time spent validating the fixes of an LLM is wasted when it could be spent tackling the bug head on. The amount of energy spent espousing garbage requires an order of magnitude more effort to invalidate.

[1]. https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

exe34 · on Aug 25, 2024

> If LLMs were successfully deployed in a production environment is the day I bite my tongue.

This shows so much faith in management!

Infinity315 · on Aug 25, 2024

Alright, fine. Maybe you don't have faith in management, but perhaps you do have faith in the open market and capitalism.

Feel free to point out any error in my logic:

There are huge financial incentives--tens if not hundreds of billions of dollars--for developing an LLM which can solve novel bugs. So surely there exists AI companies developing an LLM capable of doing so. If an LLM capable of solving novel bugs exists, AI companies would rush to showing it off to capture tonnes of VC money. AI companies could show off their fancy bug-fixing LLM by closing issues on public Github repos using said LLMs.

No such mythical LLM exists. We are thus left with two choices:

1. My logic is flawed or there is an alternative possibility I haven't considered.

2. The LLM capable of doing what OP asserts doesn't exist and can't be made, despite their assertion that it is trivial to fine tune and put into application.

reissbaker · on Aug 26, 2024

The base technology capable of this has only been broadly available for about a month — prior to Llama-3.1-70b being released on July 23rd, you couldn't finetune any GPT-4 class models that had long context support (OpenAI only allowed fine-tuning their GPT-3.5-Turbo model until last week), and you'd need long context for the incident data — so I think "proof by inexistence" is pretty weak here. The first personal computer shipped in 1974, but it took five years until the development of Visicalc for spreadsheets to appear, despite the huge business value. I wouldn't expect most use cases for LLMs to appear within a month of them being made available.

To answer your question more directly, I would go with option 1.

Infinity315 · on Aug 26, 2024

It's not proof by inexistence, it's simply application of the scientific method--only the most successful method to date.

It seems like your assumptions are unfalsifiable. As computer scientists, I believe it's important that our hypotheses are testable. If a hypothesis is unfalsifiable, then the hypothesis is no better than theology and should be discarded.

What's stopping you and other VCs just pouring endless money into an idea that won't work?

reissbaker · on Aug 26, 2024

The scientific method generally involves experiments, as opposed to claiming something won't work because of the "logic" that if it worked, someone would have done it already. This particular hypothesis is obviously a testable one: someone could simply follow the proposed steps from the hypothesis (e.g. finetune a model on their incident response data), and see if it works. This is in fact how essentially all machine learning research is done: coming up with a proposal and trying it out. If it works, great! You've probably contributed something new to machine learning research. If not, oh well, try and figure out why your experiment failed, and if you have a good alternative approach try that instead.

Your variant of the "scientific method" would've meant we never discovered electricity, or invented airplanes, or really anything else, because why bother trying? If it worked someone else would've done it.

Infinity315 · on Aug 26, 2024

> This particular hypothesis is obviously a testable one: someone could simply follow the proposed steps from the hypothesis (e.g. finetune a model on their incident response data), and see if it works.

Are you saying that if someone finetunes a current SOTA LLM with incident response data and demonstrates that it doesn't work that you'll say that LLMs are infeasible for this application? That would invalidate the hypothesis: "X application can be done on current LLMs."

Such a test could never invalidate the the hypothesis: "X application can (eventually) be done on LLMs."

If it's the former hypothesis you were asserting, then yes I agree that it is testable, but I'm fairly confident you were asserting the latter.

Earlier I had asked you: "I think we're at an epistemic impasse here. At what point would/could you be convinced that LLMs are incapable or unsuited here?"

And you have yet to provide a response.

reissbaker · on Aug 26, 2024

It would invalidate that particular approach, much as a failed attempt at creating a lightbulb would invalidate that particular approach, but would not disprove the lightbulb entirely.

Proving that LLMs can never do this would require extremely rigorous theoretical evaluation that even top ML labs are currently unable to do, given the problem of interpretability. In general proving a negative is typically harder than a positive, since a single experiment succeeding proves a positive, but a single experiment failing does not prove a negative; generally science does not demand that scientists attempt to prove a negative when running experiments, or else nearly every drug trial, for example, would be impossible to perform. Complaining that you have staked out a very difficult to defend position — that it's impossible for LLMs to generate good incident reports — does not mean your ideological opponents, who have simpler positions, must do your proof work for you.

Infinity315 · on Aug 27, 2024

Are you not making the positive claim that LLMs can (eventually) generate good incident reports?

Please refer to this: https://en.wikipedia.org/wiki/Burden_of_proof_(philosophy)

Just to make sure we both understand what burden of proof is:

Suppose two people are having a debate over whether or not a teapot exists in the orbit of Jupiter which is impossible to observe via telescope. Where does the burden of proof lie?

Just to reiterate plainly:

Does the burden of proof lie on the person making empirically impossible to falsify claim or the person making the empirically possible to falsify claim?

Which of the following two claims is impossible to empirically falsify?

1. "LLMs can eventually be used to produce good incident reports."

2. "LLMs can never eventually be used to produce good incident reports."

reissbaker · on Aug 28, 2024

Sorry, but saying "I bet this would work" does not mean I have to come up with a theoretical foundation for disproving the existence of machine learning models ever being capable of doing things, theoretical models which even the top labs in the world are incapable of producing. This is the hypothesis stage; there is no burden of proof. If I said "I proved this would work," naturally there would be a burden of evidence. That is not where we are. You are arguing with a hypothesis; and your argument does not hold water ("If this was possible someone would have already done it"). That does not mean the hypothesis is true, it only means you haven't falsified it.

Infinity315 · on Aug 28, 2024

So at this point are you admitting fully that you have no evidence for your claim? Your entire bet is conjecture?

> That does not mean the hypothesis is true, it only means you haven't falsified it.

Correct. You can't prove a negative, you've only just figured this out? After I listed TWO simple examples that could be found in an introductory philosophy class?! In the Wikipedia article consisting of a couple of paragraphs I linked you?

PLEASE JUST READ. PLEASE JUST READ. PLEASE JUST READ. PLEASE JUST READ.

You can't prove negative statements. I want you to admit this so I know you understand, now repeat after me: "You can't prove negative statements."

I know this is likely wasted on you, but here goes:

You can't prove the hypothesis: "There does not exist an unobservable teapot in the orbit of Jupiter."

For the SAME reason I can't prove the hypothesis: "LLMs can never eventually be used for generating good incident reports."

For the SAME reason I can't prove the hypothesis: "God doesn't exist."

For the SAME reason I can't prove the hypothesis: "A unicorn does not exist at the center of the Earth."

I fully admit this in the comment you've supposedly read.

This is why the burden of proof lies on the person making the positive claim (this is you).

reissbaker · on Aug 29, 2024

We're literally discussing a paragraph in which I said "I'd bet [this would work]." No amount of repeatedly claiming there's a "burden of proof" to a bet and all-caps shouting and demanding that I come up with a proof of impossibility that would show the hypothesis is wrong (you may also note that I mentioned that proving a negative is generally more difficult than proving a positive several posts ago — although, in fact, it is technically possible [1]) will make your position a reasonable one. Politely, I will not be continuing to engage with you.

1: https://en.wikipedia.org/wiki/Burden_of_proof_(philosophy)#P...

exe34 · on Aug 25, 2024

My point was that management has a history of rolling out shiny things to production and then having egg on their face. See Microsoft's racist bot, Google's AI making up stuff in their adverts, etc.

Your original wager was that it would be in production, not that it would work.

Infinity315 · on Aug 26, 2024

Wrong. My original wager would be a successful deployment, but sure that could be interpreted as a weasel word.

What I mean by successful is that the LLM can generate accurate (>80%) incident response reports and propose correct fixes.

I'm fairly certain anyone literate could have read the rest of my comment and parse out what a successful deployment means.

bryanrasmussen · on Aug 25, 2024

Your logic seems sort of like the inverse of Augustine's proof for God.

Infinity315 · on Aug 26, 2024

It's a funny thing, because the inverse is falsifiable (testable) whereas the positive version is not. The inverse proof (I would say a hypothesis) is simply application of the scientific method.

There is a way to disprove that the statement: "There is no god" by simply showing a counterfactual god.

There is however no way to disprove the statement: "There is a god."

Likewise, there is a way to disprove the statement: "LLMs cannot be successfully used for X application." By showing that LLMs have been used in X application.

Again, there is no way to disprove the statement: "LLMs can (eventually) be used in X application."

The meat of my question was meant to demonstrate a failure to apply the scientific method.

bryanrasmussen · on Sept 3, 2024

>There is a way to disprove that the statement: "There is no god" by simply showing a counterfactual god.

that both parties to the argument agree is a god.

>Likewise, there is a way to disprove the statement: "LLMs cannot be successfully used for X application." By showing that LLMs have been used in X application.

Again there the point of argumentation will be the word "successfully", the LLM would have to be such an overwhelming success at what it is trying to do that one cannot weasel out of it with "successfully".

dambi0 · on Aug 25, 2024

Even LLMs can see the false dichotomy here

Infinity315 · on Aug 25, 2024

Then it should be trivial to point out the error. So do it.

dambi0 · on Aug 25, 2024

Perhaps there are more lucrative applications where LLMs can be applied

Infinity315 · on Aug 25, 2024

Maybe. But in the list of lucrative applications I think bug-fixing is near the top. I think it's lucrative enough to attract at least a decent chunk of engineering talent.

thwarted · on Aug 25, 2024

Agreed. There have been many assessments of what bugs cost, and the assessments are often very high, and that's the reason the industry has, for decades, been working towards having _fewer_ bugs.

williamcotton · on Aug 25, 2024

It's a little shocking sometimes to me how underutilized fine-tuning is.

I imagine it is because of the costs of building a dataset.

lkrubner · on Aug 25, 2024

This strategy amounts to building a RAG for your company. I have a friend who is CTO at a startup and he is doing exactly this. He has uploaded every document he can to Anthropic/Claude, using their service which allows easy construction of RAGs. He has uploaded thousands of documents. And now he can ask it questions like, "Which will increase profits more, hiring another engineer, or hiring another sales person?"

I think we will see more of this. There are some big privacy concerns, but "an internal RAG for every business" will probably find some customers.

reissbaker · on Aug 26, 2024

RAG is different! RAG is entirely in-context learning. You can combine finetuning and RAG, though, and often get better results than either alone.

bradfox2 · on Aug 25, 2024

Having done this for domain specific engineering paperwork that looks similar to cause analysis, it does work well at param sizes << 70B.

lora does not work though, you need full parameter training if the knowledge isnt already present in pre training set.

gbasin · on Aug 25, 2024

For large models that are well-trained with large context support, ICL seems to work about as well as FT

reissbaker · on Aug 25, 2024

It's totally task dependent. On some tasks, large well-trained models are already great; if the large model already exhibits human-level performance, a small-model finetune is unlikely to beat it. Similarly, on very general tasks (e.g. "coding" as a general task, as opposed to "writing idiomatic NextJS" being a specific task), a small-model finetune will be unlikely to beat a large model.

But there are plenty of tasks that even large, well-trained models struggle with. If the OP is struggling to get useful root-cause analysis for cloud service incidents out of an existing large model, that seems exactly like a use case where a finetune would shine.

Also, finetunes don't have to be just for small models! Medium-sized models like Llama-3.1-70b can be finetuned, and if you want to burn a lot of GPUs you can finetune 405b as well.

gbasin · on Sept 3, 2024

Yes I was specifically thinking of the latter with larger models, I think many many shot ICL still tends to outperform but you're right it's worth trying both for your use case

low_tech_love · on Aug 25, 2024

The idea that LLMs “only know how to predict the next word” is a common but imho wrong cop out. It has been shown that they are good at doing a lot of things that might not have been immediately obvious, regardless of how their internal process works. 50 years ago, somebody might say “programming is only a way to automate simple repetitive tasks” and that would be obviously wrong.

The real problem in cases like this and other applications, as you and many others have mentioned, is that LLMs are basically correlation machines. They can find very complex, immensely-multivariate correlations in large data sets, and reproduce these correlations very well. But they cannot reason (so far) beyond these correlations in other to find deeper, less obvious causal relationships. They’re simply not trained to do that, yet. But it’ll come…!

austin-cheney · on Aug 25, 2024

> 50 years ago, somebody might say “programming is only a way to automate simple repetitive tasks” and that would be obviously wrong.

That is actually extremely correct. The only purpose for software is automation, which is the elimination of labor. Getting that wrong directly influences your quality of product more than any other downstream factor.

herval · on Aug 25, 2024

> The only purpose for software is automation, which is the elimination of labor. Getting that wrong directly influences your quality of product more than any other downstream factor.

I take it you never played a game in your life?

austin-cheney · on Aug 25, 2024

Do you mean baseball, football, tag, board games, cross words, Sudoku, Dungeons and Dragons, or something else? There are many games that are not electronic. So, what separates those athletic and paper games from electronic games? Automation.

herval · on Aug 26, 2024

No, I obviously don’t mean those, buddy

hinkley · on Aug 25, 2024

Devil’s advocate:

Fighting games are just better versions of Rockem Sockem Robots.

And 4X games could be fancier versions of Settlers of Catan.

herval · on Aug 26, 2024

What “labor” does a “fancier version of Catan” eliminate?

outofpaper · on Aug 25, 2024

Video games automate whole chunks of play, be that imagination with graphics and sound effects, other players with virtual enemies and allies, let alone all the automation that goes into multi-player games.

low_tech_love · on Aug 25, 2024

It might be theoretically correct in the same way that it is correct to say that “a building is just a bunch of bricks on top of each other”. But there are thousands of different buildings with different reasons to exist which offer drastically different services and serve different purposes. A video game like Elden Ring is built from the same “automation” pieces as the LS command in my terminal, but it would very disingenuous to say they’re basically the same thing.

austin-cheney · on Aug 25, 2024

That is an incorrect comparison. A building, all buildings, are dwellings, but they are no more or less the sum of their parts than anything else. It’s not about the construction materials. It’s about the utility.

teleforce · on Aug 25, 2024

Not OP but I think it is correct comparison.

Dijkstra's quote succinctly summarized the arguments:

"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim."

CoastalCoder · on Aug 25, 2024

With all due respect to Dijkstra, whether or not something is interesting is subjective.

I wonder if his point was that the question didn't need to be answered for whatever they were discussing at the time?

dTal · on Aug 25, 2024

His point is that question ultimately boils down to a semantic argument over the word "think/swim", which may be interesting to a linguist but is not philosophically meaty in the way the question implies.

low_tech_love · on Aug 25, 2024

> It’s not about the construction materials. It’s about the utility.

I think I missed your point. Wasn’t that exactly what I said?

cookie_monsta · on Aug 25, 2024

> It has been shown that they are good at doing a lot of things that might not have been immediately obvious

Could you point me to some of the places/articles where this is being shown? I'm definitely amongst those who have bought in to the common cop out you are rebutting here

geoduck14 · on Aug 25, 2024

Not OP, but I have been poking around with LLMs for a year and I'd like to add to the conversation.

In my experience, LLMs are word predictors, and the impacts of that fact are not immediately obvious.

LLMs are capable of "explaining" what code does. What it is doing under the hood is pattern matching: I've seen code that looks like X, with an explanation that looks like Y

LLMs are capable of formatting text. It has seen English written like X, that is reformatted to look like Y

One resounding fact my team has found over and over is that "the things we think are hard for LLMs aren't necessarily hard; the things we think are easy aren't necessarily easy"

dTal · on Aug 25, 2024

Worth noting that pattern matching and term rewriting are a sufficiently general combination that it forms the basis for Mathematica.

60654 · on Aug 25, 2024

I would just add that Mathematica-style term rewriting (e.g. analytical integration, or equation solving) is done with _semantics-preserving_ symbolic solvers, which are hand-made and human-reviewed to guarantee correctness.

LLM style pattern matching and rewriting does not preserve semantics, except accidentally due to an overwhelming amount of examples.

psb217 · on Aug 25, 2024

Trouble sneaks in when the pattern matching is only correct most of the time. Eg, if some code for regex-based search missed anywhere from 0.1% to 10% of matches, with the miss rate depending on the regex and no obvious way to know which regexes have worse miss rates, the utility of your regex-based search would be limited. LLMs are like this, but their generality makes them useful in spite of this limitation.

akomtu · on Aug 25, 2024

We rarely use reasoning too. When a lightning strikes, a thunder follows. When the sun shines, trees grow. These are correlations we've learned. Now can you derive them with proper reasoning? In fact, if we dig deep enough, at some point we'll face some principle or axiom that just postulates an observed correlation as a law.

What we call reasoning can be the art of finding a chain of small correlations to connect ends of a big correlation. Some sort of quantum-powered DFS algorithm.

However a reasoning machine is just a machine. Someone needs to tell it what to reason about.

from-nibly · on Aug 25, 2024

The main problem is. Most people cant tell the difference between an expert opinion and random words. So if an LLM shoots a bunch of jargon on the screen then, well so does my senior platform engineer so whats the difference? The LLM is cheaper and is always on call.

RodgerTheGreat · on Aug 25, 2024

I see deeply ominous connections between generations of American children who have been educated in a manner which produces alarming rates of functional illiteracy[0] and the widespread popularity of machine learning models that produce nonsense which resembles plausible text if you skim or don't actually read it.

[0] https://www.apmreports.org/episode/2019/08/22/whats-wrong-ho...

treyd · on Aug 25, 2024

> The theory is known as "three cueing." The name comes from the notion that readers use three different kinds of information — or "cues" — to identify words as they are reading.

> The theory was first proposed in 1967, when an education professor named Ken Goodman presented a paper at the annual meeting of the American Educational Research Association in New York City.

> In the paper,5 Goodman rejected the idea that reading is a precise process that involves exact or detailed perception of letters or words. Instead, he argued that as people read, they make predictions about the words on the page using these three cues:

> graphic cues (what do the letters tell you about what the word might be?)

> syntactic cues (what kind of word could it be, for example, a noun or a verb?)

> semantic cues (what word would make sense here, based on the context?)

This is interesting because this is fairly similar to how LLMs do next-token prediction, but using exclusively backwards-facing clues from the text (although perhaps also the graphic cues if you are talking about multimodal models).

philipswood · on Aug 25, 2024

Thanks, I think the linked article is worthy of its own submission:

https://news.ycombinator.com/item?id=41344613

pixelatedindex · on Aug 25, 2024

Super fascinating article, thanks for sharing!

sublinear · on Aug 25, 2024

> Most people cant tell the difference

And those people should be fired.

from-nibly · on Aug 26, 2024

Those people are your bosses.

visarga · on Aug 25, 2024

> Most people cant tell the difference between an expert opinion and random words.

That's because humans are stochastic parrots too. We use leaky abstractions we don't fully understand, or their edge cases. So we don't know what we are saying, but keep doing this as long as it doesn't break. When it breaks, we go to experts, another abstraction we don't really grok. What's the difference between using LLM and using a human expert? Not much if you have no clue about the topic. It's a matter of replacing real understanding with trust.

Even more, can we say we understand anything down to first principles? Probably not. It's a patchwork of people, each with their limited perspective, like the Elephant and the Blind Men parable. The world is based on functional understanding, combining partial perspectives, it's never fully grokked. And what you don't understand, you can't be conscious of, not in it's real meaning. So we might be unconscious operators of language as well, no better than hallucinating LLMs.

77pt77 · on Aug 25, 2024

Finally someone gets it.

And then there is status.

Since you don;t understand anything anyway, you ascertain value and correctness based on status signals.

For you engineer that is going to be, first and foremost the way he looks. Whether he's tall, talks without stuttering, which school he went to.

For LLMs it might be a variant of:

> No one was ever fired for choosing Google/IBM/Microsoft

deegles · on Aug 25, 2024

This project had one goal: get a high visibility project out the door in order to secure a promotion.

refulgentis · on Aug 25, 2024

What project?

generic92034 · on Aug 25, 2024

See? Those are the best projects, where no-one can remember it or knows who was responsible when it is headed for its final failure. ;)

giancarlostoro · on Aug 25, 2024

I can see an LLM talking to a cloud API (the underlying APIs) getting status reports and history, and telling you what is currently wrong, faster than you can comb through it all, and I'm a huge LLM skeptic, but if I ever saw a use for an LLM its the "tell me what's wrong faster than I can look" angle of things. You don't use it to solve the problem for you, you use it to tell you where things are faster than you and ten people can comb through.

wkat4242 · on Aug 25, 2024

> Why should I think that LLMs would be good at the task of analyzing a cloud incident and determining root cause?

Because an LLM is really good at convincing laymen it knows what it's doing. It's really a conman simulator :)

It is great at language based tasks but people use it as an oracle for everything without understanding the limitations. It's just that friendly convincing tone that makes them think they're taking to a super intelligent being.

tomjen3 · on Aug 25, 2024

You should not think that LLMs are a good fit for anything. You should check if they are, and then try to guide it to be better at the task.

You should do this because AI is a jagged frontier where humans can't predict if AI is good at something or not.

>AI is weird. No one actually knows the full range of capabilities of the most advanced Large Language Models, like GPT-4. No one really knows the best ways to use them, or the conditions under which they fail. There is no instruction manual. On some tasks AI is immensely powerful, and on others it fails completely or subtly. And, unless you use AI a lot, you won’t know which is which.

https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the...

samstave · on Aug 25, 2024

This is exactly how I try to approach using it.

I recognize that its still a robot and you have to give it stoic, stern guidance to keep it on point.

Also, having a good understanding for the domain youre seeking to build understanding in, but with the augmentation of an AI assist that can formulate the Output of the thought in a more complete and packaged manner than one is able to do without leveraging AI as a tool. And as someone said "And its not going to get any worse, its only going to get better" I think people are really underestimating, and under-utilizing AI tools.

But they are terrifying when you consider just that - a divide between those who have/use AI and those who are subjugated by those who do.

cqqxo4zV46cp · on Aug 25, 2024

God. Thank you. All these LLM conversations are making me hate this website so much, because apparently at some point actual scientific enquiry took a back seat, and turning one’s nose up at anything in a blatant attempt to seem smart has taken charge.

If all the people whinging on here took some of that time and actually ‘formally’ experimented with LLMs, measuring their reliability / correctness against a human in some task in their domain, they may be surprised by the results. And no, “I tried Copilot for an afternoon and hated it” doesn’t suffice.

At work, recently, I happened across an opportunity to do just this. There was a task that I thought that it was quite possible for an LLM to be good at. The task was such that we could run a bit of a ‘study’ to see how the LLM fared against a real-world meat-bag person. A skilled person at that. The person we would’ve had do the job in the first place. The LLM and the human agreed the vast majority of the time (>99.9%), and the LLM with its infinite ‘attention’ (heh) was on more than one occasion correct in cases where the human wasn’t, because it was a repetitive task that’d put someone to sleep.

It was a task that involved parsing language, but I’m sure one that the geniuses on HN would say requires “understanding semantics”, “intelligence” or whatever armchair philosophy nonsense they whip out in lieu of intelligent conversation. It wasn’t sentiment analysis, categorisation, or anything of that nature. Maybe it’s something I could’ve tackled without an LLM, with traditional ‘deep learning’, or whatever. I really don’t know. I couldn’t think of a way off the top of my head. It was beyond ‘throw linear regression at it’ anyway.

Software engineering isn’t engineering, but evidently computer science is increasingly not a science. This industry deserves all of the belittling pejoratives people throw at it. There’s a disappointingly large contingent of utterly unengaged, incurious, drones that let their entire professional skill set be guided by whatever some other incurious drone says on a social network.

withinboredom · on Aug 25, 2024

> because it was a repetitive task that’d put someone to sleep.

Makes me wonder why that person wouldn't just write code to automate it instead of manually making the changes.

I have several dozen custom code generators in some of my projects, where I just have a spec file written in a DSL.

famouswaffles · on Aug 25, 2024

Repetitive doesn't necessarily mentally or physically trivial (for automation).

Many of the things people who work in the trades do is repetitive work, most of which is not currently possible to automate. It's the same for mental tasks.

withinboredom · on Aug 25, 2024

> a repetitive task that’d put someone to sleep.

If it requires so little thought to do, that it causes human error ... then it is probably nonsensical to NOT automate it.

rcxdude · on Aug 26, 2024

If you can. There are still many, many tasks which are mind-numbingly boring that are not automatable.

Kiro · on Aug 25, 2024

Did you see this?

https://engineering.fb.com/2024/06/24/data-infrastructure/le...

genewitch · on Aug 25, 2024

i get real strong "Mechanical Turk" vibes from all of this faang "let us do the tedious stuff at your business".

refulgentis · on Aug 25, 2024

You're way over-estimating the "it only knows the training data" stuff to a very large degree. Ex. In the last 48 hours, Facebook also released a paper of it being excellent at RCA.

This is a well-tred argument, and it is much stronger when it holds itself to, especially in the short term, augmentation is likely, not wholesale delegation. It gets weak when it tries to couple that to asserting that "all it does" is lightly rephrasing training data. It's somewhat trivial to demonstrate this is false, even society as a whole has noticed that it's beyond a parrot, and it's worrying.

otterley · on Aug 25, 2024

> In the last 48 hours, Facebook also released a paper of it being excellent at RCA.

Do you have a link? The most recent information (not a paper) I could find was back in June: https://engineering.fb.com/2024/06/24/data-infrastructure/le...

This is the HN discussion to which you may be referring: https://news.ycombinator.com/item?id=41326039

refulgentis · on Aug 25, 2024

Correct, thank you!

jononor · on Aug 25, 2024

Practical "LLM" systems often incorporate things like Reinforcement Learning From Humans (RFLH) to steer towards more useful answers, and Retrival-Augmented Generation (RAG) to lookup relevant information. External "tool use" is also just starting to be adopted. Such systems are not just next token predictors, there is quite a bit more complexity to it.

Not that I am claiming they are great for post mortem analysis, I not really have an opinion about that.

djohnston · on Aug 25, 2024

You're trying to build a ground-up theory for why LLMs are bad for this sort of work. But the empirical evidence suggests they're quite good at it. Just try using it!

roenxi · on Aug 25, 2024

> Why should I think that LLMs would be good at the task of analyzing a cloud incident and determining root cause? LLMs are good at predicting the next word in written language. They are generative; they make new text given a prompt.

Now explain what humans are doing and why it is different. RCA can be done fully remotely, so we know it can be modelled as a word-prediction model with context and a "how is all this context made consistent?" prompt. There is every reason to expect that an advanced text prediction model would be excellent at predicting the most-technically-correct response to that prompt.

I've seen a few people make this argument as though text prediction is some specific sub-field that can be solved independently of having a world model and intelligence. That doesn't hold up at all, if we solve text prediction we've solved general intelligence - all aspects of human intelligence are less complex than being able to predict the most objectively correct next word in a sequence of words because all aspects of intelligence can be framed as a text-prediction problem. A system can't predict the next word in a sequence and fool a human without being at least as clever as a human.

crystal_revenge · on Aug 25, 2024

You don't even have to get deep into the internals of LLMs to see what's wrong with your reasoning. The problem lies with the basic mechanics of:

> predict the most objectively correct next word in a sequence of words

Currently all LLMs are only determining the most probable next token, but this means they are not aware of the probability of the entire sequence of tokens they are emitting. That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence. In practice, there are a great many very likely sentences that are composed of a fairly unlikely words. When we use the output of an LLM we're thinking it of a sequence sampled from the set of all possible sequences, but that's not really what we're getting (at least as far as probability is concerned).

There are approaches to address this: you can do multinomial sampling instead of greedy so that are casting a slightly large net or you can do beam search where your once again trying to search a broader set of possible sentences choosing by the most probable sequence. But all of these are fairly limited.

Which gets to your first remark:

> Now explain what humans are doing and why it is different.

There's very little we really know about how humans reason, but we are certainly building linguistic expressions at with a more abstract form of composition. This comment for example was planned out in parts, not even sequentially, and then reworked to the whole thing makes some sense. But at the very least humans are clearly reasoning at the level of entire sequences as their probability rather than individual tokens at a time.

The word "planning" almost tautologically implies thinking ahead of the next step. When humans write HN comments or code they're clearly planning rather than just thinking of the next most likely word over and over again with some noise to make it sound more interesting. No matter how powerful and sophisticated the mathematical models driving the core of LLMs are, we're fundamentally limited by the methods we use to sample from them.

ithkuil · on Aug 25, 2024

In order to produce the next sentence you have to produce the next word first and then the word after it and so on.

Before the model arrives at the candidates for the next word it first computes vectors in high dimensional space that combine every combination of words in the context and extract semantics from it. When producing the next token the model effectively has already "decided" the direction where the answer will go and that is encoded as a high dimensional vector before being reduced to the next token (and the process repeated)

Jensson · on Aug 25, 2024

> When producing the next token the model effectively has already "decided" the direction where the answer will go

No it hasn't, if you tell it to write a random story and it starts with "A", it hasn't figured out what the next word should be, and you run it many times from that "A" you will get many different sentences.

It will do some adapting to future possibilities, but it doesn't calculate the sentence once like you suggest it will, it comes up with a new sentence for every token it generates.

ithkuil · on Aug 25, 2024

If you "tell it" to make up a random story then your prompt is part of the context and if the model is fine-tunes to follow instructions it will emit a story that sounds like a random story that begins with A (since that's a constrain).

If instead the prompt says it should emit 5 repetitions of the letter "A" unsurprisingly it will compete the output with " A A A A".

The task is performed by emitting tokens, but on order to correctly execute the task the model has to "understand" the prompt sufficiently well in order to choose the next token (and the next etc).

Now, obviously current LLM models have severe deficiencies in the ability to model the real world (which is revealed by their failures to handle common sense scenarios). This problem is completely compounded by a psychological factor in which we humans tend to ascribe more "intelligence" to an agent that "speaks well" so the dissonance of a model that sounds intelligent and yet sometimes is so hilariously stupid throws us off rails.

But there is clearly some modeling and processing going on. It's not a mere stochastic parrot. We have those (Markov chains of various sorts) and they cannot maintain a coherent text for long. LLMs OTOH are objectively a phase transition in that space.

All I'm trying to say is that whatever is lacking in LLMs is not just merely because they "just do next token prediction".

There are other things these models should do in order to go to the next level of reasoning. It's not clear if that can be achieved just by training the model on more and more data (hoping that the models learn the trick by themselves) or whether we need to improve the architecture in order to enable the next phase.

roenxi · on Aug 25, 2024

That doesn't hold together. You seem to be arguing that LLMs produce text as a sequence of words. Which, fair enough, they obviously do.

But then your argument seems to drift into humans not producing text as a series of words. I'm not sure how you type your comments but you should upload a YouTube video of it as it sounds like it'd be quite a spectacle!

If your argument is that LLMs can't reason because they don't edit their comments, it'd be worth stopping and reflecting for a few moments about how weak a position that is. I wrote this comment linearly just to make a point with no editing except spellchecking.

pfsalter · on Aug 25, 2024

Humans don't really generate text as a series of words. If you've ever known what you wanted to say but not been able to remember the word you can see this in practice. Although the analogy is probably a helpful one, LLMs are basically doing the word remembering bit of language, without any of the thought behind it.

roenxi · on Aug 25, 2024

How do you generate your text? Do you write the middle of the sentence first, come back to the start then finish it? Or do you have a special keyboard where you drop sentences as fully formed input?

As systems humans and LLMs behave in observably similar ways. You feed in some sort of prompt+context, there is a little bit of thinking done, a response is developed by some wildly black-box method, and then a series of words are generated as output. The major difference is that the black boxes presumably work differently but since they are both black boxes that doesn't matter much for which will do a better job at root cause analysis.

People seem to go a bit crazy on this topic at the idea that complex systems can be built from primitives. Just because the LLM primitives are simple doesn't mean the overall model isn't capable of complex responses.

CharlieDigital · on Aug 25, 2024

    Do you write the middle of the sentence first, come back to the start then finish it?

Am I the only one that does this?

I'll have a central point I want to make that I jot down and then come back and fill in the text around it -- both before and after.

When writing long form, I'll block out whole sections and build up an outline before starting to fill it in. This approach allows better distribution on "points of interest" (and was how I was taught to write in the 90's).

hansvm · on Aug 25, 2024

> Currently all LLMs are only determining the most probable next token, but this means they are not aware of the probability of the entire sequence of tokens they are emitting. That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence

They're normally trained to output a probability distribution for the next token and _sample_ from that distribution. Doing so iteratively, if you work through the conditional probabilities, samples from the distribution of completed prompts (or similarly if you want to stop at a single sentence) with the same distribution as the base training data.

You're right that you can't pick the most likely sentence in general, but if there exists a sentence likely enough for you to care then you can just repeat the prompt a few times and take the most common output, adjusting the repetition count in line with your desired probability of failure. Most prompts don't have a "most likely" sentence for you to care about though. If you ask for meal suggestions with some context, you almost certainly want a different response each time, and the thing that matters is that the distribution of those responses is "good." LLMs, by design, can accomplish that so long as the training data has enough information and the task requires at most a small, bounded amount of computation.

danielmarkbruce · on Aug 26, 2024

>> That is, they can only build sentences by picking the most probable next word, but can never choose the most probable sentence

Beam search.

zahlman · on Aug 25, 2024

>the most objectively correct next word in a sequence of words

Why should I believe, in the first place, that this is even a coherent concept?

callalex · on Aug 25, 2024

> Why should I think that LLMs would be good at the task of analyzing a cloud incident and determining root cause?

Because it would be extremely profitable if they could. That’s how we’ve ended up in this mess, the promise is just so tantalizing even if it’s not based in reality.

stavros · on Aug 25, 2024

I see where you're coming from, but I think you might be underestimating what modern LLMs can do. Sure, at their core they're predicting text, but that ability translates into some pretty impressive capabilities when it comes to understanding and analyzing complex systems.

Think about it - these models have been trained on mountains of technical docs, incident reports, and discussions about cloud systems. They've soaked up a ton of knowledge about how these systems work and what tends to go wrong.

You're right that they don't reason from first principles, but they're incredibly good at spotting patterns. When you feed an LLM details about an incident, it can quickly pick up on similarities to known issues and suggest potential causes. It's not just making stuff up - it's drawing on a vast pool of relevant information.

And these models aren't just spitting out random text. They've gotten really good at understanding context and applying the right knowledge to a given situation. They can take in technical details about an incident and connect them to possible root causes in ways that can be surprisingly insightful.

I'd argue that LLMs can actually be pretty effective tools for analyzing failures in complex systems. They can process tons of information quickly, spot connections humans might miss, and generate multiple plausible hypotheses for what went wrong. They're not replacing human experts, but they can definitely augment our capabilities and speed up the initial triage process.

There are already some success stories out there of LLMs being used effectively in IT ops and incident analysis. And as these models get fine-tuned on more specific cloud-related data, their performance in this area is only going to improve.

You're right to be cautious about applying LLMs everywhere just because we can. But in this case, I think they actually have a lot to offer when it comes to cloud incident analysis, especially when used alongside other tools and human expertise. They're not a silver bullet, but they're definitely more than just elaborate text predictors when it comes to tasks like this.

This comment was written by an LLM. If it can make a coherent argument, I guess it can process information about an incident, too.

xg15 · on Aug 25, 2024

> LLMs are good at predicting the next word in written language. They are generative; they make new text given a prompt. LLMs do not have base sets of facts about how complex systems work, and do not attempt to reason over a corpus of evidence and facts.

Not wanting to advocate for using LLMs for RCA, which is a dumb and dangerous idea for all the reasons the OP mentioned - but I am getting allergic to the phrase "it just predicts the next word".

Yes, that is how the "API" of an LLM works, but on itself, it says nothing about how the LLM does the prediction and how complex the internal model is that it uses for that task.

It's obvious that the task of predicting has a huge variance in complexity, depending on which word has to be predicted.

E.g. the sentence "The apple does not fall far from the" could be completed by a decently trained Markov chain, whereas (correctly!) completing "sqrt(153847)=" would either require an impossibly large training set or an internal model that can parse integers and perform square root calculations.

Yet both are on the surface "predict the next word" tasks.

The actual complexity of LLMs' internal models seems to still be poorly understood. That's not to say the model has superhuman ability or even reaches human abilities. But the point is that it's still really hard to make predictions how complex the reasoning is that an LLM performs for a task.

In the OP's example, we can't really say if the LLM "has base sets of facts about how complex systems work", because we don't really know if and how "facts" would be represented inside the model. If enough in-domain examples were in the trainset, there is no fundamental reason why it wouldn't have learned such a set of facts.

Just saying "it can't reason at all because it's just a next word predictor" is mixing up different layers of meaning and, I believe, does not lead to more insight.

mitjam · on Aug 25, 2024

A simple example: Prompt an LLM with a log message you don't understand and see if it helps you interpret it. In my experience, in many cases, it can.

Terr_ · on Aug 25, 2024

> Why should I think that LLMs would be good a

They make language that sounds like what thinking people make, therefore they must be thinking like people do, duh! /s

charleslmunger · on Aug 25, 2024

The statement about "4,500 developer-years of work" is insane to me. Java is one of the most backward compatible languages period - other than hashmap iteration order a while back, it's hard to think of what could require that astronomical quantity of engineering effort to upgrade. Do they actually budget over a billion dollars to upgrade Java versions, or is this like "this amazing tool, sed, saved us infinity developer years by replacing strings at one quadrillionth the cost of a $500k human editing each text file by hand"

layer8 · on Aug 25, 2024

Java 9 and the Jarkarta transition unfortunately introduced a lot of compatibility issues, not the least with regard to frameworks and other dependencies used by an application. For large projects it can take months to upgrade from JDK 8 to JDK 11/17/21. For one enterprise project I’m familiar with it took over half a year.

efitz · on Aug 25, 2024

<satire/>In unrelated news, Amazon announces layoff of 4500 developer positions...

cainxinth · on Aug 26, 2024

> Do they actually budget over a billion dollars to upgrade Java versions, or is this like "this amazing tool, sed, saved us infinity developer years by replacing strings at one quadrillionth the cost of a $500k human editing each text file by hand"

I think it was more like this:

“We need to upgrade Java at some point.”

“That would take thousands of hours and a billion dollars.”

“Ok, forget it for now then.”

“Well, we could try to automate it with an LLM…”

nayuki · on Aug 25, 2024

> Java is one of the most backward compatible languages period - other than hashmap iteration order a while back

I want to know more about this. What were the behaviors before and after the change? In what year or versions did this happen? Are there any write-ups I can read?

DaiPlusPlus · on Aug 25, 2024

> "The average time to upgrade an application to Java 17 plummeted from what’s typically 50 developer-days to just a few hours."

*blink*

...why are they boasting about migrating to Java 17... in 2024?

...and from what versions were they migrating from? Java 17 didn't introduce any significant breaking-changes (IME) for users on Java 16 or even the next previous LTS version, Java 11 - I don't work at Amazon, but surely Amazon isn't in the habit of running on unsupported JVMs? - so assuming these Java projects were being competently maintained, then the only work actually required to migrate to 17 is changing your `org.gradle.java.home=` path to where JDK 17 followed by running your test suite. If Amazon was using a monorepo then 1 person could do this in 5 minutes with a 1-liner awk/sed command - whereas I expect it would likely take an AI far longer to do this, if it's even able to make sense of an Amazon-sized monorepo - they'd also likely re-prompt it for every separate project for reliability's sake. So after considering all that, the "50 days" number he gives, without any context either, is a nice shorthand to communicate his disconnection from what really goes-on inside his org... or he's lying - and he knows he's lying - but he also knows there won't be any negative consequences for him as a result of his lying, so why not lie if it gives you a good story to tell for LinkedIn?

In conclusion: these remarks by leadership unintentionally make the company look bad, not good, once you fill-in-the-blanks to make up for what they dind't say. (What's next...? Big Brother increasing our chocolate ration to 20 grammes per week?)

-----

One more thing: the comment-replies to the post on LinkedIn are utterly derranged and I genuinely can't put my sense of unease into words.

layer8 · on Aug 25, 2024

Probably from JDK 8, which is still actively supported, and will prospectively until around 2030, the main reason being that upgrading from it tends to come with major compatibility headaches. At work I still have to support JDK 8 libraries because the consuming customers won’t upgrade from it before next year at the earliest.

Terr_ · on Aug 25, 2024

> One more thing: the comment-replies to the post on LinkedIn are utterly derranged and I genuinely can't put my sense of unease into words.

I don't want to dig up my old credentials to sign in to see the entire set, but from what's publicly visible...

Perhaps the sense that some comments are mostly desperate scrambling to self-promote, and a few of those also contain fawning ingratiation which they seem to think may be reciprocated?

DaiPlusPlus · on Aug 25, 2024

> I don't want to dig up my old credentials to sign in to see the entire set

A wise decision.

> ...but from what's publicly visible...mostly desperate scrambling to self-promote ... fawning ingratiation ... reciprocated

Well, yes - there were enough of those - and they were bad enough, but what threw me off was a screenful of rambling comment replies from a single person accusing Amazon of "being racist" against her (a white woman in the US, if her avatar is accurate) while also making references to some kind of lawsuit she was pursuing - with a surprisingly restrained sprinkling of emojis throughout.

m1keil · on Aug 25, 2024

Folks let's be real. While the tech industry borrows terms and procedures from mature and inherently riskier industries like Aviation, 99% of the tech companies don't share the same risk profile.

This means that in most cases, these RCAs are the output of a long and over engineered incident review process that was designed to impress the higher echelon.

The problem is, that in a decently sized corporation, you have tens to hundreds of daily fuck ups (also known as "incidents") that completely suck out the free time out of engineers that have to navigate the long game of post incident management process.

The utilisation of LLMs on these cases are just engineered solution to the problem of organisational bureaucracy.

igornadj · on Aug 25, 2024

In my experience RCA is developer driven, pointing to structural issues in the org, which are then up to management to act on or not. For example, whistleblowers at Boeing are pointing out quality issues that are being ignored, not that there is too much paperwork like you are suggesting.

Post-mortems are as short as possible because technical people usually write them and have better things to do. Getting an LLM to do them will only remove this feedback channel, as it is much easier to ignore a LLM suggesting more time or money is spent on quality than it is to ignore a human.

droopyEyelids · on Aug 25, 2024

the trick is realizing that in a large company, the engineer time is already wasted on the goals of clueless leadership that will change before completion anyway, and then using the RCA to regain a bit of control over your roadmap (as a low level manager)

aragilar · on Aug 25, 2024

Is not aviation, space-flight, etc. part of "tech"? While you can always over-do processes, you can also under-do them as well (I don't think I need to list examples here, just look at the news...). Having the processes also be able to adapt to different needs and requirements is part of having good processes.

m1keil · on Aug 25, 2024

I don't think symantics really matter here. I largely think of "tech" as companies that produce mainly virtual goods in the form of software. Most of these software is largely skins over databases. In most cases, nobody will get hurt if their product will malfunction. This is the vast majority of tech companies today. Exceptions exists.

I would take under process any day of the week. From my experience, adding a process is far easier than removing it.

aragilar · on Aug 26, 2024

If you do "tech" == "software" maybe I'd agree with you no one will get hurt (though the cynical part of me says "did you measure this, or do you want this to be true"), but S(cience)T(echnology)E(ngineering)M(athematics) is a thing, and it's definitely not just software.

sickblastoise · on Aug 25, 2024

Here’s a simple rule, based on the fact no one has shown that an llm or a compound llm system can produce an output that doesn’t need to be verified for correctness by a human across any input:

The rate at which llm/llm compound systems can produce output > the rate at which humans can verify the output

I think it follows that we should not use llms for anything critical.

The gunghoe adoption and hamfisting of llms into critical processes, like an AWs migration to Java 17, or root cause analysis is plainly premature, naive, and dangerous.

redleggedfrog · on Aug 25, 2024

This is a highly relevant and accurate point. Let me explain how this happens in real life instead of breathless C-type hucksterism:

We have a project working on very large code-base in .NET Web Forms (and other old tech) that needs be updated to more modern tech so it can be in .NET 8 and run on linux to save hosting costs. I realize this is more complicated that just convert to later versions of Java, but it's roughly the same idea. The original estimate was for 5 devs for 5 years. C-types decide it's time to use LLMs to help this get done. We use both Co-Pilot and later others, Claude of which turned out to be the most useful. Senior devs create processes that offshore teams start using to convert code. Target tech can be varied based on updated requirements, so some went to Razor pages, some to JS with .NET API, some other stuff. Looks to be pretty good modernization at the start.

Then the Senior devs start trying to vet the changes. This turns out to be a monumental undertaking. Literally swamped code reviewing output from the offshore teams. Many, many subtle bugs were introduced. It was noted that the bugs were from the LLMs, not the offshore team.

A very real fatigue sets in among senior devs where all they're doing is vetting machine generate code. I can't tell you how mind numbing this becomes. You start to use the LLMs to help review, which seems good but really compounds the problem.

Due to the time this is taking, some parts of the code start to be vetted by just the offshore team, and only the "important things" get reviewed by Senior devs.

This works fine for exactly 5 weeks after the first live deploy. At that point the live system experiences a major meltdown and causes an outage affecting a large number of customers. All hands on deck, trying to find the problem. Days go by, system limps along on restarts and patches, until the actual primary culprit is found, which turns out to be a == for some reason being turned into a != in a particular gnarly set of boolean logic. There were other problems as well, but that particular one wreaked the most havoc.

Now they're back to formal, very careful code reviews, and I moved onto a different project on threat of leaving. If this is the future of programming, it's going to be a royal slog.

wbogusz · on Aug 25, 2024

> Here’s a simple rule, based on the fact no one has shown that an llm or a compound llm system can produce an output that doesn’t need to be verified for correctness by a human across any input:

I’m still not sure why some of us are so convinced there isn’t an answer to properly verifying LLM output. In so many circumstances, having output pushed 90-95% of the way is very easily pushed to 100% by topping off with a deterministic system.

Do I depend on an LLM to perform 8 digit multiplication? Absolutely not, because like you say, I can’t verify the correctness that would drive the statistics of whatever answer it spits out. But why can’t I ask an LLM to write the python code to perform the same calculation and read me its output?

> I think it follows that we should not use llms for anything critical.

While we are at it I think we should also institute an IQ threshold for employees to contribute to or operate around critical systems. If we can’t be sure to an absolute degree that they will not make a mistake, then there is no purpose to using them. All of their work will simply need to be double checked and verified anyway.

sickblastoise · on Aug 25, 2024

There isn’t one answer to how to do it. If you have an answer to validation for your specific use case, go for it. this is not trivial because most flashy things people want to use llms for like code generation and automated RCA’s are hard or impossible to verify without the I Need A More Intelligent Model problem.

2. I believe this is falsely equating what llms do with human intelligence. There is a skill threshhold for interacting with critical systems, for humans it comes down to “will they screw this up?” And the human can do it because humans are generally intelligent. The human can make good decisions to predict and handle potential failure modes because of this.

low_tech_love · on Aug 25, 2024

Also, let’s remember the most important thing about replacing humans with AI - a human is accountable for what they do.

That is, ignoring all the other myriad, multidimensional other nuances of human/social interactions that allow you to trust a person (and which are non-existent when you interact with an AI).

latentnumber · on Aug 25, 2024

Why not automate verification itself then? While not possible now, and I would probably never advocate for using LLMs in critical settings, it might be possible to build field-specific verification systems for LLMs with robustness guarantees as well.

RodgerTheGreat · on Aug 25, 2024

If the verification systems for LLMs are built out of LLMs, you haven't addressed the problem at all, just hand-waved a homunculus that itself requires verification.

If the verification systems for LLMs are not built out of LLMs and they're somehow more robust than LLMs at human-language problem solving and analysis, then you should be using the technology the verification system uses instead of LLMs in the first place!

wbogusz · on Aug 25, 2024

> If the verification systems for LLMs are not built out of LLMs and they're somehow more robust than LLMs at human-language problem solving and analysis, then you should be using the technology the verification system uses instead of LLMs in the first place!

The issue is not in the verification system, but in putting quantifiable bounds on your answer set. If I ask an LLM to multiply large numbers together I can also very easily verify the generated answer by topping it with a deterministic function.

I.e. rather than hoping that an LLM can accurately multiply two 10 digit numbers, I have a much easier (and verified) solution by instead asking it to perform this calculation using python and reading me the output

sickblastoise · on Aug 25, 2024

Spitballing, if you had a digital model of a commercial airplane, you could have an llm write all of the component code for the flight system, then iteratively test the digital model under all possible real world circumstances.

I think automating verification generally might require general intelligence, not an expert though.

llm_trw · on Aug 25, 2024

The same is true of computers, in fact it has been mathematically proven that it is impossible to answer the general question if a computer program is correct.

But that hasn't stopped the last 40 years from happening because computers made fewer mistakes than the next best alternative. The same needs to be true of LLMs.

mrcode007 · on Aug 25, 2024

The theory you’re alluding to says it is impossible to create a general algorithm that decides any non-trivial property of any computer program.

There is nothing in the theory that prevents you creating a program that verifies a particular specific program.

There is an entire field dedicated to doing just that.

llm_trw · on Aug 25, 2024

The issue is there to verify a program you need to have a spec. To generate a spec you need to solve the general problem.

This is what gets swept under the rug whenever formal methods are brought up.

mrcode007 · on Aug 25, 2024

That is not true at all. You do not need to generate a spec. All you need to do is prove a property. This can be done in many ways.

For example, many things can be proven about the following program without having to solve any general problem at all:

echo “hello world”

Similarly for quick sort, merge sort, and all sort of things. The degree of formality doesn’t have to go to formal methods which are only a very small part of the whole field

llm_trw · on Aug 25, 2024

>echo “hello world”

Congratulations, you just launched all the worlds nuclear missiles.

This is to spec since you didn't provide one and we just fed the teletype output into the 'arm and launch' module of the missiles.

mrcode007 · on Aug 25, 2024

What you’re saying is equivalent to throwing out all of mathematics due to the incompleteness theorem and start praying to fried egg jellyfish on full moon

llm_trw · on Aug 25, 2024

No that's what OP is saying about LLMs.

noduerme · on Aug 25, 2024

Forget RCA, we should think bigger! Putting LLMs in charge of nuclear weapons could completely eliminate the root causes of accidents worldwide!

hislaziness · on Aug 25, 2024

I know you mean this in jest, but we are much closer to this than we would imagine, the use of LLMs to process communication / translation is becoming ubiquitous. We are 1 bad translation away from a disaster.

handfuloflight · on Aug 25, 2024

Could you illustrate a likely scenario that's in your mind?

DaiPlusPlus · on Aug 25, 2024

> Could you illustrate a likely scenario that's in your mind?

The year is 2034; after a surprisingly cutthroat economic trade-war fought between China and the US left the world in the throes of another great recession, a growing wave of anti-China sentiment captures the attention of domestic political leadership which cultivates the movement despite (or more likely: because of) the growing interest from xenophobic reactionaries and other populist movements looking to scapegoat their way out of a dip in GDP - eventually those same poltiical-actors win the presidency and use their democratic mandate to instigate a new McCarthy-era of anti-China paranoia leading to utterly deranged domestic policy, namely as the executive ordering, by-decree, that the State Department terminate the employment of anyone who even speaks Mandarin[1] - a few weeks later in the South China Sea another Filpino/Sino boat-ramming incident escalates into something serious - the US Navy urges the US civilian government to communicate with China over the D.C.-to-Beijing "red telephone" deescalation e-mail system, but no-one knows how to communicate to the Chinese in their own language, so the overworked federal employee manning the red-Outlook-inbox sees nothing wrong with simply having that Microosft Office 365 CoPilot translate it for him - the same AI bot that's somehow always on his screen with that distracting sidebar (despite the best efforts of the US Federal Gov's Active Directory Group Policy) - it wasn't long before the first warheads exploded over North America that the President learned the AI translated the polite request to China for them to "please stop ramming the fishing boats" was received by them as "I'll ram my fish into your Junk...boats". If there's any upside to this story, the collective mass of AI were wiped out first by the high-altitude EMP bursts, leaving us humans with the last-laugh before we were all incinerated moments later - while those not fortunate enough to die instantly instead suffered months of prolonged fatal radiaiton sickness while what little left of civilization collapsed around them[2].

[1]If you think that's too ridiculous to be realistic, consider the Japanese internment-camp policy or Trump's declared Muslim ban. Elsewhere, in the late-1970s (in the age of CT Scanners and VHS tapes), Pol Pot targeted people for wearing glasses.

[2]Blame James Burke's editorial slant in his documentary series' for turning me into a nhilist.

noduerme · on Aug 25, 2024

That scenario has me less worried, just because by 2035 all popular politicians in America and China will be AI deepfakes run off the same cloud servers.

[edit] I herewith introduce a new shitcoin called NukeCoin. Everyone in China and America gets a NukeCoin that will go up in value every day you hold it that no one nukes anyone.

walterbell · on Aug 25, 2024

Supply chain contract negotiations?

ooterness · on Aug 25, 2024

"In three years, Cyberdyne will become the largest supplier of military computer systems. All stealth bombers are upgraded with Cyberdyne computers, becoming fully unmanned. Afterwards, they fly with a perfect operational record. The Skynet Funding Bill is passed. The system goes online August 4th, 1997. Human decisions are removed from strategic defense. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug." -Terminator 2

grugagag · on Aug 25, 2024

Don’t stop at that, take it further handing over all decision making to LLMs, what could go wrong? In fact, replacing all jobs with LLMs should do in the minds of Silcone Valey venture capitalists. Again, what could go wrong LLMs can’t be used to fix with just one succesive prompt?

indus · on Aug 25, 2024

Most LLMs have an accuracy benchmark for controlled questions & answers.

Even if this accuracy is 95% then in a complex system the probability of getting to the right answer diminishes with each new step being added. This is also the key tenet of an agentic system.

While the analysis in the blog is excellent but an answer needs to be found. A layer on top of LLMs for error control/check.

As an analogy, in the analog to digital transmission stack of OSI, an error correction mechanism such as frame check sequence (fcs) detects transmission errors in the data link layer.

zebomon · on Aug 25, 2024

This articles speak directly to what I believe will become the unfortunate and inconvenient reality of LLMs' core limitation: at some point, someone has to 1) benefit from what's been generated and 2) (and more importantly) know they're benefiting from what's been generated.

If for example you have an LLM agent that's effectively "solved" every security flaw your software may encounter for the next 50 years, unless it can simultaneously impart 50 years of training to the people who rely on the software, it's done nothing but introduce us to more complex flaws that we would need approximately 49 more years of experience to tackle ourselves.

ryoshu · on Aug 25, 2024

Why would you ever use a non-deterministic model for a deterministic function?

rboyd · on Aug 25, 2024

FTA: "If we offload the RCA learning/categorization part to the LLM (whatever that means), we wouldn't be able to make much progress in the enhancing reliability and safety part."

But you don't offload it in the sense that you expect the tool to completely take the wheel.

You ask it for suggestions to inform a human. If the suggestions turn out to only be a distraction in your environment then you abandon the tool.

For plenty of environments the suggestions will be hugely useful and save you valuable time during an ongoing outage.

layer8 · on Aug 25, 2024

They are specifically talking about automating RCA, not about tool-assisted RCA.

rboyd · on Aug 25, 2024

Go read the papers on automated RCA. The algorithms are designed to suggest top K candidates for RCA. By definition at least k-1 are going to be wrong.

https://arxiv.org/abs/2305.10638