Does anyone trust benchmarks at this point? Genuine question. Isn't the scientif...

energy123 · 2025-11-18T18:05:15 1763489115

They overly emphasize tasks with small context without noise and red herrings in the context.

rkozik1989 · 2025-11-19T14:45:10 1763563510

Honestly, I am inclined to think a lot of the people who are wowed by benchmarks and simple tech demos probably aren't doing very much at their day job and if they're either working on simple codebases or ones that don't have very many users(more users == more bugs found). When you throw these models at complex software projects like SOAs, big object-oriented codebases, etc. their output can be totally unusable.

mudkipdev · 2025-11-18T16:55:07 1763484907

I make my own automated benchmarks

ummonk · 2025-11-18T18:43:28 1763491408

Is there a tool / website that makes this process easy?

mudkipdev · 2025-11-18T18:54:41 1763492081

I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph