Honestly, I am inclined to think a lot of the people who are wowed by benchmarks and simple tech demos probably aren't doing very much at their day job and if they're either working on simple codebases or ones that don't have very many users(more users == more bugs found). When you throw these models at complex software projects like SOAs, big object-oriented codebases, etc. their output can be totally unusable.
I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph