That’s what I hope for, but everything that isn’t bananas expensive with unified memory has very low memory bandwidth. DGX (Digits), Framework Desktop, and non-Ultra Macs are all around 128 gb/s, and will produce single digits tokens per second for larger models: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...
So there’s a fundamental tradeoff between cost, inference speed, and hostable model size for the foreseeable future.
So there’s a fundamental tradeoff between cost, inference speed, and hostable model size for the foreseeable future.