Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".




codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.

Will you look at this man's prompting skills?!

Serious prompt engineering right here

Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?

gpt-5.2-codex xhigh with OpenAI codex on the $20/month plan got to 1526 cycles with OP's prompt for me. Meanwhile claude code with Opus 4.5 on the team premium plan ($150/month) gave up with a bunch of contrived excuses at 3433 cycles.

That Claude Opus 4.5 result of 4,973 is what you get if you just vectorize the reference kernel. In fact you should be under 4,900 doing that with very little effort (I tried doing this by hand yesterday).

The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.


Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.

Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.


Thanks so much!!

Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?

I tried GLM-4.7 running locally on a beefy GPU server, in about 3 minutes it got to 25846 cycles, but then struggled in circles for about 90 minutes without making any meaningful progress, making the same mistakes repeatedly and misdiagnosing the cause most of the time. It seems to understand what needs to happen to reach the goal, but keeps failing on the implementation side. It seemed to understand that to beat the target an entirely new approach would be required (it kept leaning towards a wavefront design), but wasn't seeing the solution due to the very limited ISA.

Could you make a repo with solutions given by each model inside a dir/branch for comparison?

Are you giving instructions to a stranger on the internet?

Instructions?! Just asked since GP already did it. No need to realize top comment's "DDOS attack on other AI companies" joke.

I think he’s asking rather than giving instructions

He's prompting

I do wonder how Grok would compare, specifically their Claude Code Fast model.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: