Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thus far, this is one of the best objective evaluations of real world software engineering...




I concur with the other commenters, 4.5 is a clear improvement over 4.

Idk, Sonnet 4.5 score better than Sonnet 4.0 on that benchmark, but is markedly worse in my usage. The utility of the benchmark is fading as it is gamed.

I think I and many others have found Sonnet 4.5 to generally be better than Sonnet 4 for coding.

Maybe if you confirm to its expectations for how you use it. 4.5 is absolutely terrible for following directions, thinks it knows better than you, and will gaslight you until specifically called out on its mistake.

I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.


I'm very happy with it TBH, it has some things that annoy me a little bit:

- slower compared to other models that will also do the job just fine (but excels at more complex tasks),

- it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),

- it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything


The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.

Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.


> The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!

With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.

It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?

In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.

We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.


The problem is that I'm trying to build workflows for generating sequences of good, high quality semantically grouped changes for pull requests. This requires having a bunch of unrelated changes existing in the work tree at the same time, doing dependency analysis on the sequence of commits, and then pulling out / staging just certain features at a time and committing those separately. It is sooo much easier to do this by explicitly avoiding the commit-every-2-seconds workaround and keeping things uncommitted in the work tree.

I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.

As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).


I think this is probably just a matter of noise. That's not been my experience with Sonnet 4.5 too often.

Every model from every provider at every version I've used has intermingled brilliant perfect instruction-following and weird mistaken divergence.


What do you mean by noise?

In this case I can't get 4.5 to follow directions. Neither can anyone else, aparantly. Search for "Sonnet 4.5 follow instructions" and you'll find plenty of examples. The current top 2:

https://www.reddit.com/r/ClaudeCode/comments/1nu1o17/45_47_5...

https://theagentarchitect.substack.com/p/claude-sonnet-4-pro...


Not my experience at all, 4.5 is leagues ahead the previous models albeit not as good as Gemini 2.5.

I find 4.5 a much better model FWIW.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: