Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?




They overly emphasize tasks with small context without noise and red herrings in the context.

Honestly, I am inclined to think a lot of the people who are wowed by benchmarks and simple tech demos probably aren't doing very much at their day job and if they're either working on simple codebases or ones that don't have very many users(more users == more bugs found). When you throw these models at complex software projects like SOAs, big object-oriented codebases, etc. their output can be totally unusable.

I make my own automated benchmarks

Is there a tool / website that makes this process easy?

I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: