No, it would make it worse. This adds more computation and sacrifices throughput...

yencabulator · 2026-05-27T18:14:17 1779905657

Non-predicted token generation requires num_of_tokens_output passes over the weights.

Correctly-predicted token generation, requires num_of_tokens_output/prediction_size passes over the weights, plus a much smaller model to make those predictions.

Incorrectly-predicted token generation adds some overhead to the above, relative to the hit rate.

It sounds like good predictions would actually decrease the total overhead while improving latency. (Same FLOPs, but less memory bandwidth consumed -> probably run just as hot, but get more done.)