From a quick and shallow view of the paper, it looks very feasible (with a littl...

sleepyeldrazi · 2026-05-16T07:39:07 1778917147

Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.

Boranbruh · 2026-05-16T07:54:46 1778918086

There are websites that let you rent GPUs for cheap, such as QuickPod. Have you checked those P2P GPU rentals out?

sleepyeldrazi · 2026-05-16T09:23:06 1778923386

My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.

sleepyeldrazi · 2026-05-16T18:02:11 1778954531

If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .

The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.

Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.

dot_treo · 2026-05-16T19:45:39 1778960739

And it also looks like the original authors are working on qwen 3.5 too: https://github.com/chiennv2000/orthrus/issues/1#issuecomment...

dot_treo · 2026-05-16T19:16:19 1778958979

I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.

sleepyeldrazi · 2026-05-16T20:04:16 1778961856

It is actually very exciting that they are also working on 3.5, I will keep this toy project up in the meantime, trying it out and testing things around it helps me learn a bunch.

As for the treating them as a block idea, that was my initial plan, but the GatedDeltaNet is doing most of the work in 3.5. Trying to bundle them together would hurt acceptance rates drastically, potentially making the speed benefits not a lot bigger, or smaller, than the native MTP.