Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, it would make it worse.

This adds more computation and sacrifices throughput to improve latency of a serial single-user generation.

Large scale providers run inference in batches, sacrificing latency to gain throughput.



Non-predicted token generation requires num_of_tokens_output passes over the weights.

Correctly-predicted token generation, requires num_of_tokens_output/prediction_size passes over the weights, plus a much smaller model to make those predictions.

Incorrectly-predicted token generation adds some overhead to the above, relative to the hit rate.

It sounds like good predictions would actually decrease the total overhead while improving latency. (Same FLOPs, but less memory bandwidth consumed -> probably run just as hot, but get more done.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: