The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.
Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.
I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.
It is actually very exciting that they are also working on 3.5, I will keep this toy project up in the meantime, trying it out and testing things around it helps me learn a bunch.
As for the treating them as a block idea, that was my initial plan, but the GatedDeltaNet is doing most of the work in 3.5. Trying to bundle them together would hurt acceptance rates drastically, potentially making the speed benefits not a lot bigger, or smaller, than the native MTP.
The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.
Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.