I wanted to believe but anyone who has spent any time trying to run models local...

sandworm101 · 2026-04-14T01:01:08 1776128468

I am running q 4xgpu rig at home (similar to a mining rig) doing everything from llms to content creation. I have learned a lot. Having an AI rig today is much like having an early PC in the 80s. You dont appeciate the possible uses until you have it in your hands.

All you need is a used GPU slapped onto any disused ddr4 mobo. New 5060s, the 16gb models, can do basically everything now.

wingtw · 2026-04-14T04:35:41 1776141341

Can you specify a bit, what gpus and how do you wire them "together"? Nvlink?

sandworm101 · 2026-04-14T05:13:00 1776143580

A couple 5060s and a couple 3060s. They are wired via PCI risers to an older mono with an amd cpu. (I wanted to avoid long 3-fan cards.) It looks like a mining rig, but with thicker pci risers. Many llm tools easily leverage multiple GPUs. Sucks 800w at full load, idles below 50w.

MrDrMcCoy · 2026-04-14T14:58:21 1776178701

Would you please share a link to your chassis and risers? I have the PCIE lanes, but not yet encountered a reasonable way to have more than 3 GPUs directly attached to a host, both from physical space and power requirements. External PCIe switch cases are not reasonably available to mortals :/

sandworm101 · 2026-04-15T07:27:26 1776238046

Just search amazon for "pci4 riser" and you can get them up to a foot long. Any sort of mining frame will do. Power is a bigger issue. Running multiple power supplies is something i know about but have not done personally. Nor do i want to. Im happy keeping everything on one circuit.

xrd · 2026-04-14T12:34:17 1776170057

I have three 3090 cards. Are you saying you run them together using specialized hardware or can I somehow combine them using software over Ethernet?

suprjami · 2026-04-18T04:40:37 1776487237

If you have either PCIe slots or risers you can put them in the one system.

llama.cpp will let you run inference remotely across different systems but I suspect this would be far too latent to be worthwhile. If you have three systems already then it would cost you a few minutes to test it.

suprjami · 2026-04-18T04:38:33 1776487113

You don't.

With multiple cards in normal PCI express slots, LLM layers are split across cards.

When you run inference, it runs on one card then the other card. You can repeat this for as many cards as you want.

You only copy the activations between the cards which ~10 MB/sec at runtime so PCIe width or generation is irrelevant. Even PCIe 1.0 x1 would be sufficient.

There are other software optimisations (row split, tensor parallel) which require fast interlinks like NVLink but you can get a long way without any of that.

h4kunamata · 2026-04-13T23:39:46 1776123586

Not entirely.

I am running OpenWeb UI + Ollama + 7B on a Proxmox LXC container, it consumes less than 2GB, the GPU only has 4GB, and 50% CPU, it is very usable, sometimes faster than online ones to start giving you the answer and 100% offline.

If I replace the GPU with a faster one, I have no need to use online ones.

wilkystyle · 2026-04-13T22:15:58 1776118558

Curious to hear more. My experience is limited to llama.cpp on Apple silicon so far, but have been eyeing AMD ecosystem from afar.

craftkiller · 2026-04-13T23:35:27 1776123327

FWIW I run llama.cpp on AMD hardware using Vulkan. I've got no complaints but also nothing else to compare against.

verdverm · 2026-04-13T23:13:47 1776122027

The main thing to consider is that how you run the models does not need to be coupled to the what you send models (and how you orchestrate agents).

I've used several agent frameworks and they all support many different providers from cloud to local. These are orthogonal responsibilities. I'm using VertexAI for cloud and ollama on a minisforum with rocm locally. There is a dropdown to change between them.

nevi-me · 2026-04-13T22:22:55 1776118975

Perhaps not a good example, I tried running local models a few times, to much disappointment (actually made me skeptical of LLMs in general for a while).

My last experiment in January was trying to run a Qwen model locally (RTX 4080; 128GB RAM; 9950X3D). I must have been doing it extremely wrong because the models that I tried either hallucinated severely or got stuck in a loop. The funniest one was stuck in a "but wait, ..." loop.

I fortunately had started experimenting with Claude, so I opted to pay Anthropic more money for tokens (work already covers the bill, this was for personal use).

That whole experience + a noisy GPU, put me off the idea of running/building local agents.

buryat · 2026-04-13T23:11:04 1776121864

I have a Mac Studio with 512GB Ram and ran models of different sizes to test out how local agents are and I agree that local models aren't there yet but that depends on whether you need a lot of knowledge or not to answer your question, and I think it should be possible to either distill or train a smaller model that works on a subset of knowledge tailored toward local execution. My main interest is in reducing the latency and it feels that the local agents that work at high speeds should be an answer to this but it's not something that someone is trying to solve yet. Feels like if I could get a smaller model that could run at incredible speed locally that could unlock some interesting autoresearching.

robwwilliams · 2026-04-14T01:19:05 1776129545

Also running gemma-4 on Apple M5 Max. As fast or faster than Opus 4.6 extended but not of course the same competence. However, great tunability with llama.cpp and no issues related to IP leakage.

verdverm · 2026-04-13T23:18:09 1776122289

I've been running Gemma4, my initial experiments put it around gemini-3-flash levels (vibe evals)

musicale · 2026-04-14T03:51:49 1776138709

> Mac Studio with 512GB Ram

Nice to score one of those.

lostmsu · 2026-04-14T00:54:44 1776128084

I hope you are not running models under Q8, preferably Q8 directly from the vendor.

cyberax · 2026-04-14T00:31:00 1776126660

Uhmm... I have a local Ollama setup on Linux+AMD, and it was only a bit more involved than this sample. And only because I wanted to run everything in a container.

If you mean that you can't just run the largest unquantized models, then it's indeed true.