Naively tested a set of agents on this task. Each ran the same spec headlessly i...

lawrencechen · 2026-01-21T07:47:06 1768981626

codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.

jstummbillig · 2026-01-21T07:55:33 1768982133

Will you look at this man's prompting skills?!

dudewhocodes · 2026-01-21T11:41:50 1768995710

Serious prompt engineering right here

mettamage · 2026-01-21T11:57:21 1768996641

Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?

woadwarrior01 · 2026-01-21T13:00:20 1769000420

gpt-5.2-codex xhigh with OpenAI codex on the $20/month plan got to 1526 cycles with OP's prompt for me. Meanwhile claude code with Opus 4.5 on the team premium plan ($150/month) gave up with a bunch of contrived excuses at 3433 cycles.

HarHarVeryFunny · 2026-01-23T12:53:52 1769172832

That Claude Opus 4.5 result of 4,973 is what you get if you just vectorize the reference kernel. In fact you should be under 4,900 doing that with very little effort (I tried doing this by hand yesterday).

The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.

ponyous · 2026-01-21T07:30:33 1768980633

Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.

a24j · 2026-01-21T08:44:01 1768985041

Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.

languid-photic · 2026-01-21T14:13:33 1769004813

Sure!

https://github.com/voratiq/voratiq

a24j · 2026-01-22T01:18:31 1769044711

Thanks so much!!

raphaelj · 2026-01-21T10:44:56 1768992296

Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?

kevinday · 2026-01-21T21:07:35 1769029655

I tried GLM-4.7 running locally on a beefy GPU server, in about 3 minutes it got to 25846 cycles, but then struggled in circles for about 90 minutes without making any meaningful progress, making the same mistakes repeatedly and misdiagnosing the cause most of the time. It seems to understand what needs to happen to reach the goal, but keeps failing on the implementation side. It seemed to understand that to beat the target an entirely new approach would be required (it kept leaning towards a wavefront design), but wasn't seeing the solution due to the very limited ISA.

forgotpwd16 · 2026-01-21T07:34:16 1768980856

Could you make a repo with solutions given by each model inside a dir/branch for comparison?

kitrak95 · 2026-01-21T07:42:16 1768981336

Are you giving instructions to a stranger on the internet?

forgotpwd16 · 2026-01-21T09:01:22 1768986082

Instructions?! Just asked since GP already did it. No need to realize top comment's "DDOS attack on other AI companies" joke.

edf13 · 2026-01-21T08:30:49 1768984249

I think he’s asking rather than giving instructions

pelagicAustral · 2026-01-21T09:00:06 1768986006

He's prompting

giancarlostoro · 2026-01-21T07:32:24 1768980744

I do wonder how Grok would compare, specifically their Claude Code Fast model.