LLMs absolutely produce reams of hard-to-debug code. It's a real problem.
But "Teams that care about quality will take the time to review and understand LLM-generated code" is already failing. Sounds nice to say, but you can't review code being generated faster than you can read it. You either become a bottleneck (defeats the point) or you rubber-stamp it (creates the debt). Pick your poison.
Everyone's trying to bolt review processes onto this. That's the wrong layer. That's how you'd coach a junior dev, who learns. AI doesn't learn. You'll be arguing about the same 7 issues forever.
These things are context-hungry but most people give them nothing. "Write a function that fixes my problem" doesn't work, surprise surprise.
We need different primitives. Not "read everything the LLM wrote very carefully" ways to feed it the why, the motivation, the discussion and prior art. Otherwise yeah, we're building a mountain of code nobody understands.
We use the various instruction .md files for the agents and update them with common issues and pitfalls to avoid, as well as pointers to the coding standards doc.
Gemini and Claude at least seem to work well with it, but sometimes still make mistakes (e.g. not using c++ auto is a recurrent thing, even though the context markdown file clearly states not to). I think as the models improve and get better at instruction handling it will get better.
Not saying this is "the solution" but it gets some of the way.
I think we need to move away from "vibe coding", to more caring about the general structure and interaction of units of code ourselves, and leave the AI to just handle filling in the raw syntax and typing the characters for us. This is still a HUGE productivity uplift, but as an engineer you are still calling the shots on a function by function, unit by unit level of detail. Feels like a happy medium.
It does rather invite the question of whether the most popular programming languages today are conductive to "more caring about the general structure and interaction of units of code" in the first place. Intuitively it feels that something more like say Ada SPARK, with its explicit module interfaces and features like design by contract would be better suited to this.
Same thing with syntax - so far we've been optimizing for humans, and humans work best at a certain level of terseness and context-dependent implicitness (when things get too verbose, it's visually difficult to parse), even at the cost of some ambiguity. But for LLMs verbosity can well be a good thing to keep the model grounded, so perhaps stuff like e.g. type inference, even for locals, is a misfeature in this context. In fact, I wonder if we'd get better results if we forced the models to e.g. spell out the type of each expression in full, maybe even outright stuff like method chains and require each call result to be bound to some variable (thus forcing LM to give it a name, effectively making a note on what it thinks it's doing).
Literate programming also feels like it should fit in here somewhere...
So, basically, a language that would be optimized specifically for LLMs to write, and for humans to read and correct.
Going beyond the language itself, there's also a question of ecosystem stability. Things that work today should continue to work tomorrow. This includes not just the language, but all the popular libraries.
And what are we doing instead? We're having them write Python and JavaScript, of all things. One language famous for its extreme dynamism, with a poorly bolted on static type system; another also like that, but also notorious for its footguns and package churn.
It's better if the bottleneck is just reviewing, instead of both coding and reviewing, right?
We've developed plenty of tools for this (linting, fuzzing, testing, etc). I think what's going on is people who are bad at architecting entire projects and quickly reading/analyzing code are having to get much better at that and they're complaining. I personally enjoy that kind of work. They'll adapt, it's not that hard.
There's plenty of changes that don't require deep review, though. If you're written a script that's, say, a couple fancy find/replaces, you probably don't need to review every usage. Check 10 of 500, make sure it passes lint/tests/typecheck, and it's likely fine.
The problem is that LLM-driven changes require this adversarial review on every line, because you don't know the intent. Human changes have a coherence to them that speeds up review.
(And you your company culture is line-by-line review of every PR, regardless of complexity ... congratulations, I think? But that's wildly out of the norm.)
A proper line by line review tops out at 400-500 lines per hour and the reviewer should be spent and take a 30 minute break. It’s a great process if you’re building a spaceship I guess.
Yes, "just take the time to review and understand LLM-generated code" is the new "just don't write bad code and you won't have any bugs". As an industry, we all know from years of writing bugs despite not wanting to that this task is impossible at scale. Just reviewing all the AI code to make sure it is good code likewise does not scale in the same way. Will not work, and it will take 5-10 years for the industry to figure it out.
I've had some (anecdotal) success reframing how I think about my prompts and the context I give the LLM. Once I started thinking about it as reducing the probability space of output through priming via context+prompting I feel like my intuition for it has built up. It also becomes a good way to inject the "theory" of the program in a re-usable way.
It still takes a lot of thought and effort up front to put that together and I'm not quite sure where the breakover line between easier to do-it-myself and hand-off-to-llm is.
The correct primitives are the tests. Ensure your model is writing tests as you go, and make sure you review the tests, which should be pretty readable. Don't merge until both old and new tests pass. Invest in your test infrastructure so that your test suite doesn't get too slow, as it will be in the hot path of your model checking future work.
Legacy code is that which lacks tests. Still true in the LLM age.
> You either become a bottleneck (defeats the point)
How...?
When I found code snippets from StakcOverflow, I read them before pasting them into my IDE. I'm the bottleneck. Therefore there is no point to use StackOverflow...?
But "Teams that care about quality will take the time to review and understand LLM-generated code" is already failing. Sounds nice to say, but you can't review code being generated faster than you can read it. You either become a bottleneck (defeats the point) or you rubber-stamp it (creates the debt). Pick your poison.
Everyone's trying to bolt review processes onto this. That's the wrong layer. That's how you'd coach a junior dev, who learns. AI doesn't learn. You'll be arguing about the same 7 issues forever.
These things are context-hungry but most people give them nothing. "Write a function that fixes my problem" doesn't work, surprise surprise.
We need different primitives. Not "read everything the LLM wrote very carefully" ways to feed it the why, the motivation, the discussion and prior art. Otherwise yeah, we're building a mountain of code nobody understands.