I wanted to believe but anyone who has spent any time trying to run models locally knows this is not going to be solved by two lines of python running on rocm as the example shows.
I am running q 4xgpu rig at home (similar to a mining rig) doing everything from llms to content creation. I have learned a lot. Having an AI rig today is much like having an early PC in the 80s. You dont appeciate the possible uses until you have it in your hands.
All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.
A couple 5060s and a couple 3060s. They are wired via PCI risers to an older mono with an amd cpu. (I wanted to avoid long 3-fan cards.) It looks like a mining rig, but with thicker pci risers. Many llm tools easily leverage multiple GPUs. Sucks 800w at full load, idles below 50w.
Would you please share a link to your chassis and risers? I have the PCIE lanes, but not yet encountered a reasonable way to have more than 3 GPUs directly attached to a host, both from physical space and power requirements. External PCIe switch cases are not reasonably available to mortals :/
Just search amazon for "pci4 riser" and you can get them up to a foot long. Any sort of mining frame will do. Power is a bigger issue. Running multiple power supplies is something i know about but have not done personally. Nor do i want to. Im happy keeping everything on one circuit.
If you have either PCIe slots or risers you can put them in the one system.
llama.cpp will let you run inference remotely across different systems but I suspect this would be far too latent to be worthwhile. If you have three systems already then it would cost you a few minutes to test it.
With multiple cards in normal PCI express slots, LLM layers are split across cards.
When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.
You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.
There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.
I am running OpenWeb UI + Ollama + 7B on a Proxmox LXC container, it consumes less than 2GB, the GPU only has 4GB, and 50% CPU, it is very usable, sometimes faster than online ones to start giving you the answer and 100% offline.
If I replace the GPU with a faster one, I have no need to use online ones.
The main thing to consider is that how you run the models does not need to be coupled to the what you send models (and how you orchestrate agents).
I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.
Perhaps not a good example, I tried running local models a few times, to much disappointment (actually made me skeptical of LLMs in general for a while).
My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.
I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).
That whole experience + a noisy GPU, put me off the idea of running/building local agents.
I have a Mac Studio with 512GB Ram and ran models of different sizes to test out how local agents are and I agree that local models aren't there yet but that depends on whether you need a lot of knowledge or not to answer your question, and I think it should be possible to either distill or train a smaller model that works on a subset of knowledge tailored toward local execution. My main interest is in reducing the latency and it feels that the local agents that work at high speeds should be an answer to this but it's not something that someone is trying to solve yet. Feels like if I could get a smaller model that could run at incredible speed locally that could unlock some interesting autoresearching.
Also running gemma-4 on Apple M5 Max. As fast or faster than Opus 4.6 extended but not of course the same competence. However, great tunability with llama.cpp and no issues related to IP leakage.
Uhmm... I have a local Ollama setup on Linux+AMD, and it was only a bit more involved than this sample. And only because I wanted to run everything in a container.
If you mean that you can't just run the largest unquantized models, then it's indeed true.