Between this and http://myticker.com (posted recently), I want to share a theory of mine:
1) the internet is mostly made up of spaces where the median opinion is vanishingly rare among actual humans.
2) the median internet opinion is that of a person who is deep into the topic they're writing about.
The net result is that for most topics, you will feel moderate to severe anxiety about being "behind" about what you shuld be doing.
I'm 40, and I'm active. I ran a half marathon last weekend. I spent 5 hours climbing with my kids this weekend. My reaction to these articles, emotionally, was "I'm probably going to die of heart disease," because my cholesterol is a bit high and my BMI is 30. When I was biking 90 miles a week, my VO2 max was "sub-standard."
Let's assume this information is true. That's OK. It's all dialed up to 11, and you don't have to do anything about it right now.
Across the population as a whole, BMI 30 is basically negligible increase in all-cause mortality. For someone otherwise reasonably active I wouldn't stress about the number. Ideal is somewhere around 27.
BMI is useful for screening purposes but on an individual basis it's meaningless as a predictor of all-cause mortality. What really matters is body composition, or more specifically amount of visceral fat (subcutaneous fat doesn't matter nearly as much).
Where are you getting this number? Over 27% body fat is a health risk. For an active but not muscular individual, 30 BMI is at least 33% body fat, likely higher.
Yes, those are the definitions we have assigned to the number. However, independent of the arbitrary labels, the actual impact on health matters more to me.
Don't feel bad about your VO2 max, the baseline and ceiling are largely genetic. Most people can only bump VO2 max by about 10-15% even with absurd training regimens. Same goes with many of the markers people track - you can control them to an extent, but some people just have high blood pressure or poor lipid profiles and thus need intervention.
Thanks for saying that. Even when I ran every day — with the occasional VO2max sprint day —, my Apple Watch never placed me anywhere but Below Average for VO2max. It was disheartening. Some of these metrics actually put you off training.
I don't think we ever get away from the code being the source of truth. There has to be one source of truth.
If you want to go all in on specs, you must fully commit to allowing the AI to regenerate the codebase from scratch at any point. I'm an AI optimist, but this is a laughable stance with current tools.
That said, the idea of operating on the codebase as a mutable, complex entity, at arms length, makes a TON of sense to me. I love touching and feeling the code, but as soon as there's 1) schedule pressure and 2) a company's worth of code, operating at a systems level of understanding just makes way more sense. Defining what you want done, using a mix of user-centric intent and architecture constraints, seems like a super high-leverage way to work.
The feedback mechanisms are still pretty tough, because you need to understand what the AI is implicitly doing as it works through your spec. There are decisions you didn't realize you needed to make, until you get there.
We're thinking a lot about this at https://tern.sh, and I'm currently excited about the idea of throwing an agentic loop around the implementation itself. Adversarially have an AI read through that huge implementation log and surface where it's struggling. It's a model that gives real leverage, especially over the "watch Claude flail" mode that's common in bigger projects/codebases.
> There are decisions you didn't realize you needed to make, until you get there.
Is the key insight and biggest stumbling block for me at the moment.
At the moment (encourage by my company) I'm experimenting with as hands off as possible Agent usage for coding. And it is _unbelievably_ frustrating to see the Agent get 99% of the code right in the first pass only to misunderstand why a test is now failing and then completely mangle both it's own code and the existing tests as it tries to "fix" the "problem". And if I'd just given it a better spec to start with it probably wouldn't have started producing garbage.
But I didn't know that before working with the code! So to develop a good spec I either have to have the agent stopping all the time so I can intervene or dive into the code myself to begin with and at that point I may as well write the code anyway as writing the code is not the slow bit.
And my process now (and what we're baking into the product) is:
- Make a prompt
- Run it in a loop over N files. Full agentic toolkit, but don't be wasteful (no "full typecheck, run the test suite" on every file).
- Have an agent check the output. Look for repeated exploration, look for failures. Those imply confusion.
- Iterate the prompt to remove the confusion.
First pass on the current project (a Vue 3 migration) went from 45 min of agentic time on 5 files to 10 min on 50 files, and the latter passed tests/typecheck/my own scrolling through it.
>Adversarially have an AI read through that huge implementation log and surface where it's struggling.
That's a good idea, have a specification, divide into chunks, have an army of agents, each of them implementing a chunk, have an agent identify weak points, incomplete implementations, bugs and have an army of agents fixing issues.
The reason code can serve as the source of truth is that it’s precise enough to describe intent, since programming languages are well-specified. Compilers have freedom in how they translate code into assembly and two different compilers ( or even different optimization flags) will produce distinct binaries. Yet all of them preserve the same intent and observable behaviour that the programmer cares about. Runtime performance or instruction order may vary, but the semantics remain consistent.
For spec driven development to truly work, perhaps what’s needed is a higher level spec language that can express user intent precisely, at the level of abstraction where the human understanding lives, while ensuring that the lower level implementation is generated correctly.
A programmer could then use LLMs to translate plain English into this “spec language,” which would then become the real source of truth.
If you're thinking about, e.g. upgrading to Django 5, there's a bunch of changes that are sort of code-mod-shaped. It's possible that there's not a codemod for it it that works for you.
Tern can write that tool for you, then use it. It gives you more control in certain cases than simply asking the AI to do something that might appear hundreds of times in your code.
"OK let's scale this to 100m users" --> Tells me how it would. No schema change.
"Did you update the schema?" --> Updates the schema, tells me what it did.
We've been running into this EXACT failure mode with current models, and it's so irritating. Our agent plans migrations, so it's code-adjacent, but the output is a structured plan (basically: tasks, which are prompt + regex. What to do; where to do it.)
The agent really wants to talk to you about it. Claude wants to write code about it. None of the models want to communicate with the user primarily through tool use, even when (as I'm sure ChartDB is) HEAVILY prompted to do so.
I think there's still a lot of value there, but it's a bummer that we as users are going to have to remind all LLMs for a little bit to do keep using their tools beyond the 1st prompt.
I asked it to abstract a event-specific table to a GP "events" table which it did, but kept the specific table. I asked it to delete that table and it said it did, but did not. I got stuck in a loop asking it to remove the table that the LLM insisted was not part of the schema, but was present in the diagram.
It was easier to close the tab than fire a human, but other than that not a great experience.
isnt this what the agents are for, you assign them jobs to make changes then evaluate those changes. there is a necessary orchestration piece and maybe even a triage role to sort through things to do and errors to fix
It really seems like all the next big leaps in AI are going to be fine-tuning fit-for-purpose models.
Everything past GPT5 has been ... fine. It's better at chat (sort of, depending on your tone preferenc) and way better at coding/tool use. In our product (plan out a migration with AI), they've gotten worse, because they want to chat or code. I'd have expected the coding knowledge to generalize, but no! Especially Claude really wants to change our code or explain the existing plan to me.
We're getting around it with examples and dynamic prompts, but it's pretty clear that fine-tuning is in our future. I suspect most of the broad-based AI success is going to look like that in the next couple years.
We'll need to find a way to make fine-tuning happen on consumer hardware. I hope we do that sooner rather than later. $196 is not awful, but still pretty high up on the cost side for hobbyists.
well, fine-tuning is possible on consumer hardware, the problem is that it would be slow and that you're limited in the size of the dataset you can use in the process.
In case you would want to follow the approach in this paper and synthetically augment a dataset – using an LLM for that (instead of a smaller model) just makes sense and then the entire process cannot be easily run on your local machine.
LLMs absolutely produce reams of hard-to-debug code. It's a real problem.
But "Teams that care about quality will take the time to review and understand LLM-generated code" is already failing. Sounds nice to say, but you can't review code being generated faster than you can read it. You either become a bottleneck (defeats the point) or you rubber-stamp it (creates the debt). Pick your poison.
Everyone's trying to bolt review processes onto this. That's the wrong layer. That's how you'd coach a junior dev, who learns. AI doesn't learn. You'll be arguing about the same 7 issues forever.
These things are context-hungry but most people give them nothing. "Write a function that fixes my problem" doesn't work, surprise surprise.
We need different primitives. Not "read everything the LLM wrote very carefully" ways to feed it the why, the motivation, the discussion and prior art. Otherwise yeah, we're building a mountain of code nobody understands.
We use the various instruction .md files for the agents and update them with common issues and pitfalls to avoid, as well as pointers to the coding standards doc.
Gemini and Claude at least seem to work well with it, but sometimes still make mistakes (e.g. not using c++ auto is a recurrent thing, even though the context markdown file clearly states not to). I think as the models improve and get better at instruction handling it will get better.
Not saying this is "the solution" but it gets some of the way.
I think we need to move away from "vibe coding", to more caring about the general structure and interaction of units of code ourselves, and leave the AI to just handle filling in the raw syntax and typing the characters for us. This is still a HUGE productivity uplift, but as an engineer you are still calling the shots on a function by function, unit by unit level of detail. Feels like a happy medium.
It does rather invite the question of whether the most popular programming languages today are conductive to "more caring about the general structure and interaction of units of code" in the first place. Intuitively it feels that something more like say Ada SPARK, with its explicit module interfaces and features like design by contract would be better suited to this.
Same thing with syntax - so far we've been optimizing for humans, and humans work best at a certain level of terseness and context-dependent implicitness (when things get too verbose, it's visually difficult to parse), even at the cost of some ambiguity. But for LLMs verbosity can well be a good thing to keep the model grounded, so perhaps stuff like e.g. type inference, even for locals, is a misfeature in this context. In fact, I wonder if we'd get better results if we forced the models to e.g. spell out the type of each expression in full, maybe even outright stuff like method chains and require each call result to be bound to some variable (thus forcing LM to give it a name, effectively making a note on what it thinks it's doing).
Literate programming also feels like it should fit in here somewhere...
So, basically, a language that would be optimized specifically for LLMs to write, and for humans to read and correct.
Going beyond the language itself, there's also a question of ecosystem stability. Things that work today should continue to work tomorrow. This includes not just the language, but all the popular libraries.
And what are we doing instead? We're having them write Python and JavaScript, of all things. One language famous for its extreme dynamism, with a poorly bolted on static type system; another also like that, but also notorious for its footguns and package churn.
It's better if the bottleneck is just reviewing, instead of both coding and reviewing, right?
We've developed plenty of tools for this (linting, fuzzing, testing, etc). I think what's going on is people who are bad at architecting entire projects and quickly reading/analyzing code are having to get much better at that and they're complaining. I personally enjoy that kind of work. They'll adapt, it's not that hard.
There's plenty of changes that don't require deep review, though. If you're written a script that's, say, a couple fancy find/replaces, you probably don't need to review every usage. Check 10 of 500, make sure it passes lint/tests/typecheck, and it's likely fine.
The problem is that LLM-driven changes require this adversarial review on every line, because you don't know the intent. Human changes have a coherence to them that speeds up review.
(And you your company culture is line-by-line review of every PR, regardless of complexity ... congratulations, I think? But that's wildly out of the norm.)
A proper line by line review tops out at 400-500 lines per hour and the reviewer should be spent and take a 30 minute break. It’s a great process if you’re building a spaceship I guess.
Yes, "just take the time to review and understand LLM-generated code" is the new "just don't write bad code and you won't have any bugs". As an industry, we all know from years of writing bugs despite not wanting to that this task is impossible at scale. Just reviewing all the AI code to make sure it is good code likewise does not scale in the same way. Will not work, and it will take 5-10 years for the industry to figure it out.
I've had some (anecdotal) success reframing how I think about my prompts and the context I give the LLM. Once I started thinking about it as reducing the probability space of output through priming via context+prompting I feel like my intuition for it has built up. It also becomes a good way to inject the "theory" of the program in a re-usable way.
It still takes a lot of thought and effort up front to put that together and I'm not quite sure where the breakover line between easier to do-it-myself and hand-off-to-llm is.
The correct primitives are the tests. Ensure your model is writing tests as you go, and make sure you review the tests, which should be pretty readable. Don't merge until both old and new tests pass. Invest in your test infrastructure so that your test suite doesn't get too slow, as it will be in the hot path of your model checking future work.
Legacy code is that which lacks tests. Still true in the LLM age.
> You either become a bottleneck (defeats the point)
How...?
When I found code snippets from StakcOverflow, I read them before pasting them into my IDE. I'm the bottleneck. Therefore there is no point to use StackOverflow...?
This is a very HN sort of sentiment. How can I be persuasive without being gross?
I had a bit of a moment when I first became a PM. (I've done a bunch of things, engineering / sales / founding, but PM only sort of recently.) I realized that my job was to wake up in the morning and pick fights. Or more diplomatically: to tell people they were doing the wrong thing, and they should be doing a different thing, in a way that made them want to listen to me more in the future, not less.
That's the job. In fact, in almost every job, that's the job.
Impact happens when you reach people and they behave differently because of you. That's nothing to be ashamed of. If you do it authentically and with good intent, it's one of the best things you can do with your time.
Think of what you are doing as revealing information as to why you think your new approach is more aligned with business and business goals. Give them room to do the same.
There might be systemic issues getting in the way. You and them having competing OKRs for example. Good to surface that and deal with it too.
Right -- the stereotypes of "selling" or "telling" or "persuading" are unhelpful in a lot of contexts.
Even in direct selling, many people don't want to feel they're being sold to! At a minimum, they don't want to feel out of control on decision they care about. But they're frequently open to learning, even if the constraints of how much time / credit they'll give you are extremely different.
Everybody is different, but the biggest reason I struggle with this right now is the pace of modern life.
Doing hard things is hard, and that means I won't be thinking about the other stuff I have to do. I'm more apt to miss a text from my family when I'm running or writing a document than when I'm vibe coding, because the effort is all-encompassing. Subconsciously, that's stressful, so I steer away from it.
Habits help here, because with enough repetition, I learn that it's OK to disappear for an hour to do the thing. But the real issue is getting the meta-organization of my life right enough that I'm not scared to shut down my ambient executive function for that hour. This shows up as both "I'm too busy to do the hard thing" and "I'm too tired to do the hard thing."
Slowing down isn't the answer, but it's been pretty transformative to notice that that's what I'm worried about.
I agree. There's always so much to do just to stay on top of things. Everything from writing to people down to watering plants and updating software.
Last summer I went to a festival, and for a week I was unreachable, had no working phone, and had no chores. I could eat by showing my bracelet. I didn't even have the time. It was blissful.
I feel like the main missing point of every "coding will go away" articles (which this mostly, but not fully, is) misses that applications of any reasonable complexity are not fully defined anywhere but their code. That drives everything about modern software.
We've always had the ability to tie the code extremely directly to user experiences. The first problem is that it creates massive code duplication. The second (related) problem is that it creates an enormous, unsquashable number of bugs. There's no internal consistency.
Modern apps are COMPLEX. The abstractions and internal, non-user structure is the only reason they're maintainable at all. I'm working an a 6-month-old, 4-person codebase, and LLM-driven refactors, with me in the drivers seat, miss stuff all the time. If I forget that feature exists in this tiny codebase, how do organizations of any size function? Abstraction, interfaces, etc. -- software design.
The future is probably closer to "LLMs help clean up codebases" than "LLMs own the codebase," because the 2nd statement is effectively equal to "I will rewrite my code in English." English is a _bad_ language for describing complex systems. We could do better than Python, sure, but it's already _far_ better than English for the kind of cross-cutting behavior-and-dependency description software requires. It's valuable to have tools that translate intent into compact, high-quality code, but you still end up working with code as the fundamental artifact.
I was talking with somebody about their migration recently [0], and we got to speculating about AI and how it might have helped. There were basically 2 paths:
- Use the AI and ask for answers. It'll generate something! It'll also be pleasant, because it'll replace the thinking you were planning on doing.
- Use the AI to automate away the dumb stuff, like writing a bespoke test suite or new infra to run those tests. It'll almost certainly succeed, and be faster than you. And you'll move onto the next hard problem quickly.
It's funny, because these two things represent wildly different vibes. The first one, work is so much easier. AI is doing the job. In the second one, work is harder. You've compressed all your thinking work, back-to-back, and you're just doing hard thing after hard thing, because all the easy work happens in the background via LLM.
If you're in a position where there's any amount of competition (like at work, typically), it's hard to imagine where the people operating in the 2nd mode don't wildly outpace the people operating in the first, both in quality and volume of output.
But also, it's exhausting. Thinking always is, I guess.
I’ve tried the second path at work and it’s grueling.
“Almost certainly succeed” requires that you mostly plan out the implementation for it, and then monitor the LLM to ensure that it doesn’t get off track and do something awful. It’s hard to get much other work done in the meantime.
I feel like I’m unlocking, like, 10% or 20% productivity gains. Maybe.
Burning out a substantial portion of the workforce for short term gains is going to cause way more long term decline than the short term gains are worth
I think the long term assumption is that the first path mentioned by trjordan mentioned above, where AI does all the work, is the goal. The second path is a necessary evil until the first path, which requires as yet unachieved improvements in AI (maybe approaching AGI, maybe not) becomes feasible. Burning out employees doesn't matter since they're still creating more value than they otherwise would, and they'll be replaced by AI anyway.
Yeah I think this is what I've tried to articulate to people that you've summed up well with "You've compressed all your thinking work, back-to-back, and you're just doing hard thing after hard thing" - Most of the bottleneck with any system design is the hard things, the unknown things, the unintended-consequences things. The AIs don't help you much with that.
There is a certain amount of regular work that I don't want to automate away, even though maybe I can. That regular work keeps me in the domain. It leads to epiphany's in regards to the hard problems. It adds time and something to do in between the hard problems.
> There is a certain amount of regular work that I don't want to automate away, even though maybe I can. That regular work keeps me in the domain. It leads to epiphany's in regards to the hard problems. It adds time and something to do in between the hard problems.
Exactly, some kinds of refactors are like this for me. Pretty mindless, kind of relaxing, almost algebraic. It's a pleasant way to wander around the code base just cleaning and improving things while you walk down a data or control flow. If you're following a thread then you don't even make decisions really, but you also get better acquainted with parts you don't know, and subconsciously get the practice holding some kind of gestalt in your head.
This kind of almost dream-like "grooming" seems important and useful, because it preps you for working with design problems later. Definitely formatting and style type trivia should absolutely be automated, and real architecture/design work requires active engagement. But there's a sweet spot in the middle.
Even before LLMs maybe you could automate some of these refactors with tools for manipulating ASTs or CSTs, if your language of choice had those tools. But automating everything that can be automated won't necessarily pay off if you're losing fluency that you might need later.
In my experience, a lot of the hard thinking gets done in my back-brain while I'm doing other things, and emerges when I take up the problem again. Doing the regular work gives my back-brain time to percolate; doing hard thing after hard thing doesn't.
Also at the end of the day, humans aren't machines. We are goopy meat and chemistry.
You cannot exclusively do hard things back to back to back every 8 hour day without fail. It will either burn you out, or you will make mistakes, or you will just be miserable.
Human brains do not want to think hard, because millions of years of evolution built brains to be cheap, and they STILL use like 10% of our daily energy.
I'd actually say that you end up needing to think more in the first example.
Because as soon as you realize that the output doesn't do exactly what you need, or has a bug, or needs to be extended (and has gotten beyond the complexity that AI can successfully update), you now need to read and deeply understand a bunch of code that you didn't write before you can move forward.
I think it can actually be fine to do this, just to see what gets generated as part of the brainstorming process, but you need to be willing to immediately delete all the code. If you find yourself reading through thousands of lines of AI-generated code, trying to understand what it's doing, it's likely that you're wasting a lot of time.
The final prompt/spec should be so clear and detailed that 100% of the generated code is as immediately comprehensible as if you'd written it yourself. If that's not the case, delete everything and return to planning mode.
> I'd actually say that you end up needing to think more in the first example.
Yes, but you are thinking about the wrong things, so the effort get spent poorly.
It is usually much more efficient to build your own mental model than to try to search for a solution that solves exactly what you need from externally. Without that mental model it is hard to evaluate whether the external solution even does what you want, so its something you need to do either way.
Depends how complex the task is. Sometimes I’m handed tasks so simple but tedious that AI has meant I can breeze through these instead of burning myself out on them. Sure, it doesn’t speed things up much in terms of time, but I’m way less burnt out at the end because it’s doing all the fiddly stuff that would tire me out. I suspect the tasks I get aren’t that typical though.
Yeah, I think if it's simple enough that you can understand all the code that's generated at a glance, then it's fine. There are definitely tasks that fit this description—my comment was mainly speaking to more complex tasks.
regarding #2: "Automate the dumb/boring stuff", I always think of the big short when Michael Burry said "yes I read all the boring spreadsheets, and I now have a contrary position". And ended up being RIGHT.
For example, I believe writing unit tests is way too important to be fully relegated to the most junior devs, or even LLM generation! In other fields, "test engineer" is an incredibly prestigious position to have, for example "lead test engineer, Space X/ Nasa/etc" -- that ain't a slouch job, you are literally responsible for some of the most important validation and engineering work done at the company.
So I do question the notion that we can offload the "simple" stuff and just move on with life. It hasn't really fully worked well in all fields, for example have we really outsourced the boring stuff like manufacturing and made things way better? The best companies making the best things do typically vertically integrate.
I stay at the architecture, code organization and algorithm level with AI. I plan things at that level then have the agent do full implementation. I have tests (which have been audited both manually and by agents) and I have multiple agents audit the implementation code. The pipeline is 100% automated and produces very good results, and you can still get some engineering vibes from the fact that you're orchestrating a stochastic workflow dag!
The problem with LLMs is that they are not good enough to do the dumb stuff by themselves. and they are still so dumb that they will bias you once you have to intervene.
But this is the idea behind compilers, type checkers, automated testing, version control, and etc. It's perfectly valid.
AI sloppiness of this blog post aside, it's a reasonable observation.
If you're thinking about how to integrate AI into your system, it's worth asking the question of why your system isn't just ChatGPT.
- Do you have unique data you can pass as context?
- Do you have APIs or actions that are awkward to teach to other systems via MCP?
- Do you have a unique viewpoint that you are writing into your system prompt?
- Do you have a way to structure stored information that's more valuable than freeform text memories?
- etc.
For instance, we [0] are writing an agent that helps you plan migrations. You can do this with ChatGPT, but it hugely benefits from (in descending order of uniqueness) access to
1) a structured memory that's a cross between Asana and the output of `grep` in a spreadsheet,
2) a bunch of best-practice instructions on how to prep your codebase for a migration, and
3) the AI code assistant-style tools like ls, find, bash, etc.
So yeah, we're writing at agent, not building a model. And I'm not worried about ChatGPT doing this, despite the fact that GPT5 is pretty good at it.
1) the internet is mostly made up of spaces where the median opinion is vanishingly rare among actual humans.
2) the median internet opinion is that of a person who is deep into the topic they're writing about.
The net result is that for most topics, you will feel moderate to severe anxiety about being "behind" about what you shuld be doing.
I'm 40, and I'm active. I ran a half marathon last weekend. I spent 5 hours climbing with my kids this weekend. My reaction to these articles, emotionally, was "I'm probably going to die of heart disease," because my cholesterol is a bit high and my BMI is 30. When I was biking 90 miles a week, my VO2 max was "sub-standard."
Let's assume this information is true. That's OK. It's all dialed up to 11, and you don't have to do anything about it right now.