Simon says if he gets a suspiciously good result he'll just try a bunch of other...

ddalex · 2025-11-18T16:06:39 1763481999

https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari

jmmcd · 2025-11-18T16:04:21 1763481861

"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

BoorishBears · 2025-11-18T19:18:47 1763493527

The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.

Like replacing named concepts with nonsense words in reasoning benchmarks.

jmmcd · 2025-11-19T09:06:49 1763543209

Yes. But "the gold standard" just means "the most natural, easy and dumb way".