Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> internal team to create an ARC replica, covering very similar puzzles

they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.





The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.

Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.

Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.


they have two sets:

- semi-private, which they use to test proprietary models and which could be leaked

-private: used to test downloadable open source models.

ARG-AGI prize itself is for open source models.


My point is that it does not matter if the set is private or not.

If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.

It is not that hard, really, just tedious.


Yes you can build your dataset of n puzzles but it was still really hard for any system to achieve any scores, it even beats specialized one for this just one task and this puzzles shouldn't really be possible just to be memorized by the amount of variations that can be created.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: