LLMs absolutely produce reams of hard-to-debug code. It's a real problem. But "T...

mattlondon · 2025-09-30T14:17:47 1759241867

We use the various instruction .md files for the agents and update them with common issues and pitfalls to avoid, as well as pointers to the coding standards doc.

Gemini and Claude at least seem to work well with it, but sometimes still make mistakes (e.g. not using c++ auto is a recurrent thing, even though the context markdown file clearly states not to). I think as the models improve and get better at instruction handling it will get better.

Not saying this is "the solution" but it gets some of the way.

I think we need to move away from "vibe coding", to more caring about the general structure and interaction of units of code ourselves, and leave the AI to just handle filling in the raw syntax and typing the characters for us. This is still a HUGE productivity uplift, but as an engineer you are still calling the shots on a function by function, unit by unit level of detail. Feels like a happy medium.

int_19h · 2025-10-01T00:14:27 1759277667

It does rather invite the question of whether the most popular programming languages today are conductive to "more caring about the general structure and interaction of units of code" in the first place. Intuitively it feels that something more like say Ada SPARK, with its explicit module interfaces and features like design by contract would be better suited to this.

Same thing with syntax - so far we've been optimizing for humans, and humans work best at a certain level of terseness and context-dependent implicitness (when things get too verbose, it's visually difficult to parse), even at the cost of some ambiguity. But for LLMs verbosity can well be a good thing to keep the model grounded, so perhaps stuff like e.g. type inference, even for locals, is a misfeature in this context. In fact, I wonder if we'd get better results if we forced the models to e.g. spell out the type of each expression in full, maybe even outright stuff like method chains and require each call result to be bound to some variable (thus forcing LM to give it a name, effectively making a note on what it thinks it's doing).

Literate programming also feels like it should fit in here somewhere...

So, basically, a language that would be optimized specifically for LLMs to write, and for humans to read and correct.

Going beyond the language itself, there's also a question of ecosystem stability. Things that work today should continue to work tomorrow. This includes not just the language, but all the popular libraries.

And what are we doing instead? We're having them write Python and JavaScript, of all things. One language famous for its extreme dynamism, with a poorly bolted on static type system; another also like that, but also notorious for its footguns and package churn.

trjordan · 2025-09-30T15:04:51 1759244691

100% agree. If you care about API design, data flow, and data storage schemas, you're already halfway there.

I think there's more juice to squeeze there. A lot of what we're going to learn is how to pick the right altitude of engagement with AI, I think.

sbene970 · 2025-10-02T17:12:22 1759425142

> even though the context markdown file clearly states not to

You might know this, but telling the LLM what to do instead of what not to do generally works better, or so I heard.

Herring · 2025-09-30T13:13:41 1759238021

> You … become a bottleneck (defeats the point)

It's better if the bottleneck is just reviewing, instead of both coding and reviewing, right?

We've developed plenty of tools for this (linting, fuzzing, testing, etc). I think what's going on is people who are bad at architecting entire projects and quickly reading/analyzing code are having to get much better at that and they're complaining. I personally enjoy that kind of work. They'll adapt, it's not that hard.

trjordan · 2025-09-30T13:31:29 1759239089

There's plenty of changes that don't require deep review, though. If you're written a script that's, say, a couple fancy find/replaces, you probably don't need to review every usage. Check 10 of 500, make sure it passes lint/tests/typecheck, and it's likely fine.

The problem is that LLM-driven changes require this adversarial review on every line, because you don't know the intent. Human changes have a coherence to them that speeds up review.

(And you your company culture is line-by-line review of every PR, regardless of complexity ... congratulations, I think? But that's wildly out of the norm.)

baq · 2025-09-30T19:58:44 1759262324

A proper line by line review tops out at 400-500 lines per hour and the reviewer should be spent and take a 30 minute break. It’s a great process if you’re building a spaceship I guess.

wtetzner · 2025-09-30T16:57:55 1759251475

> It's better if the bottleneck is just reviewing, instead of both coding and reviewing, right?

Not really. There's something very "generic" about LLM generated code that makes you just want gloss over it, no matter how hard you try not to.

acedTrex · 2025-10-01T00:05:41 1759277141

The bottleneck has never been coding lol

ModernMech · 2025-09-30T14:58:50 1759244330

Yes, "just take the time to review and understand LLM-generated code" is the new "just don't write bad code and you won't have any bugs". As an industry, we all know from years of writing bugs despite not wanting to that this task is impossible at scale. Just reviewing all the AI code to make sure it is good code likewise does not scale in the same way. Will not work, and it will take 5-10 years for the industry to figure it out.

shinecantbeseen · 2025-09-30T15:23:36 1759245816

I've had some (anecdotal) success reframing how I think about my prompts and the context I give the LLM. Once I started thinking about it as reducing the probability space of output through priming via context+prompting I feel like my intuition for it has built up. It also becomes a good way to inject the "theory" of the program in a re-usable way.

It still takes a lot of thought and effort up front to put that together and I'm not quite sure where the breakover line between easier to do-it-myself and hand-off-to-llm is.

solatic · 2025-09-30T16:19:56 1759249196

> We need different primitives

The correct primitives are the tests. Ensure your model is writing tests as you go, and make sure you review the tests, which should be pretty readable. Don't merge until both old and new tests pass. Invest in your test infrastructure so that your test suite doesn't get too slow, as it will be in the hot path of your model checking future work.

Legacy code is that which lacks tests. Still true in the LLM age.

raincole · 2025-09-30T15:28:39 1759246119

> You either become a bottleneck (defeats the point)

How...?

When I found code snippets from StakcOverflow, I read them before pasting them into my IDE. I'm the bottleneck. Therefore there is no point to use StackOverflow...?