Amazing. Is it possible to do this with Qwen 3.6 27B? Will it work with quants (...

sleepyeldrazi · 2026-05-16T07:18:28 1778915908

From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.

sleepyeldrazi · 2026-05-16T07:39:07 1778917147

Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.

Boranbruh · 2026-05-16T07:54:46 1778918086

There are websites that let you rent GPUs for cheap, such as QuickPod. Have you checked those P2P GPU rentals out?

sleepyeldrazi · 2026-05-16T09:23:06 1778923386

My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.

sleepyeldrazi · 2026-05-16T18:02:11 1778954531

If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .

The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.

Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.

dot_treo · 2026-05-16T19:45:39 1778960739

And it also looks like the original authors are working on qwen 3.5 too: https://github.com/chiennv2000/orthrus/issues/1#issuecomment...

dot_treo · 2026-05-16T19:16:19 1778958979

I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.

sleepyeldrazi · 2026-05-16T20:04:16 1778961856

It is actually very exciting that they are also working on 3.5, I will keep this toy project up in the meantime, trying it out and testing things around it helps me learn a bunch.

As for the treating them as a block idea, that was my initial plan, but the GatedDeltaNet is doing most of the work in 3.5. Trying to bundle them together would hurt acceptance rates drastically, potentially making the speed benefits not a lot bigger, or smaller, than the native MTP.

0-_-0 · 2026-05-16T12:29:10 1778934550

3.6 already supports multi token generation AFAIK

jbellis · 2026-05-16T13:25:13 1778937913

Yes, but not diffusion based, it's still doing token-at-a-time speculation.

0-_-0 · 2026-05-16T17:48:07 1778953687

I thought it can do multiple tokens at a time

sleepyeldrazi · 2026-05-16T18:51:17 1778957477

Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.

regularfry · 2026-05-16T19:26:47 1778959607

There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.