Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That’s what I hope for, but everything that isn’t bananas expensive with unified memory has very low memory bandwidth. DGX (Digits), Framework Desktop, and non-Ultra Macs are all around 128 gb/s, and will produce single digits tokens per second for larger models: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

So there’s a fundamental tradeoff between cost, inference speed, and hostable model size for the foreseeable future.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: