these 0.5 and 0.6B models etc. are _fantastic_ for using as a draft model in spe...

mmoskal · 2025-04-28T23:45:04 1745883904

Spec decoding only depends on the tokenizer used. It's transfering either the draft token sequence or at most draft logits to the main model.

jasonjmcghee · 2025-04-29T04:04:49 1745899489

Could be an lm studio thing, but the qwen3-0.6B model works as a draft model for the qwen3-32B and qwen3-30B-A3B but not the qwen3-235B-A22B model

jasonjmcghee · 2025-04-28T23:54:05 1745884445

I suppose that makes sense, for some reason I was under the impression that the models need to be aligned / have the same tuning or they'd have different probability distributions and would reject the draft model really often.

Havoc · 2025-04-29T19:01:01 1745953261

Have you had any luck getting actual speedups? All the combinations I've tried (smallest 0.6 + largest I can fit into 24gb)...all got me slowdowns despite decent hitrate