Fantastic results. Well done. ...So this is built into the way the model works.....

dot_treo · 2026-05-16T09:27:56 1778923676

Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality.

At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.

kreelman · 2026-05-16T13:05:54 1778936754

I thought that might be the case. I naively wondered. I'll see if I can understand the paper :-)

Hope the paper gets lots of references and the technique gets a lot of use to save power and time.

There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.

Interesting times.

regularfry · 2026-05-16T19:29:02 1778959742

MTP merged today, a couple of hours after your post by the looks of things.

dot_treo · 2026-05-16T21:51:05 1778968265

By the looks of it, it will take a couple more follow up PRs to clean things up a bit and get the most performance from MTP. I hope that by that point it will be easier to add more spec decoding types.

In the meantime I've benchmarked Orthrus some more and got some quite promising results. So I'd be glad if my prediction that it may take some time until it lands in llama.cpp turns out to be wrong.