Field report · 001

What to Buy for Local LLM Inference: Strix Halo, Mac Studio, DGX Spark, or a GPU Rig

A buyer's guide to local LLM hardware, ranked by the one spec that actually decides generation speed: memory bandwidth. Plus where ROCm really stands on Strix Halo.

Author

Erik

Date

2026-06-25

Read

13 min read

Ordered by the one number that actually decides it: memory bandwidth. Plus where ROCm really stands on Strix Halo, which comes down to your software stack more than your hardware.

A note on links: some hardware below is linked through affiliate programs, marked (affiliate). If you buy through one, I earn a small commission at no cost to you. It never changes what I recommend or what I run. These are the boxes I'd point you to either way, and I will say plainly where I have no link and recommend the thing anyway.

I've run a Strix Halo box (Ryzen AI Max+ 395) in production since January. If you're shopping for local-LLM hardware, the usual version of the question is "what's the best mini PC for local AI." That's not quite the right one, and taking it at face value is how you end up with the wrong box.

Six months in, I think the buying decision is really two separate ones.

First: is Strix Halo even the right class of machine for what you run, or are you better off with a Mac Studio, a DGX Spark, or a multi-GPU rig? One number, memory bandwidth, decides that more than anything else on the spec sheet.

Second: if you do land on Strix Halo, what should you expect from the software? The chip is the same across all of these boxes, and so, mostly, is the board. What varies is the ROCm software stack, and that's on no spec sheet. It's the part that cost me the most to learn. I spent months convinced it was a hardware problem before a different software stack proved me wrong.

Think of this as a map. I keep it current as boxes ship, firmware moves, and the next silicon lands.

One number decides most of this: memory bandwidth

Token generation is bound by memory bandwidth. Every token you produce means reading the whole model's weights out of memory at least once, so the rate you can move bytes across the bus sets your ceiling on tokens per second. Compute matters for prompt processing (prefill), and capacity decides what fits at all, but generation speed, the thing most people are actually buying for, comes down to bandwidth. I went deep on the why in What 96GB of VRAM Actually Gets You. I will keep it short and to the point here.

Comparing the four classes of local AI boxes by bandwidth shows why you should consider one over the other:

Class	Bandwidth	Max memory	Rough price	Software
Multi-GPU rig (2 to 4 cards)	~0.9 to 1.8 TB/s per card	24 to 32 GB per card	$2,000 to $6,000+	CUDA
Mac Studio M3 Ultra	819 GB/s	~96 GB now (was 512)	$3,999+	MLX / Metal
Mac Studio M4 Max	546 GB/s	up to 128 GB	$1,999+	MLX / Metal
DGX Spark (GB10) + clones	273 GB/s	128 GB	$3,999 to $4,699	CUDA (full stack)
Strix Halo (Ryzen AI Max+ 395)	256 GB/s	up to 128 GB	~$1,800 to $3,000	Vulkan, ROCm

The most useful thing on that table is the bottom two rows sitting next to each other. The DGX Spark, NVIDIA's $4,000 "personal AI supercomputer," runs at 273 GB/s. A Strix Halo box at less than half the price runs at 256 GB/s. For token generation they're in the same class. So the Spark's price isn't really about inference speed. It's about CUDA and the NVIDIA stack, which is a fair reason to want one. The bandwidth-bound part of your workload just won't notice the difference.

Everything else on the table follows the same logic. The Mac Studio M3 Ultra's 819 GB/s is the real unified-memory bandwidth leap, roughly three times Strix Halo or Spark, and it shows up directly as faster generation. A multi-GPU rig's 1 TB/s-plus per card is faster still, on whatever fits in one card's VRAM. None of this is subtle once you look at the bandwidth numbers instead of the marketing.

Run this when: picking your class

A multi-GPU rig (used 3090s, 4090s, 5090s)

Run this when raw generation speed is the priority and your models either fit in one card's VRAM or you're fine sharding across cards. The dense models that crawl on unified memory run fast here. And multi-user serving, batch throughput, and training or fine-tuning are where a GPU rig clearly beats everything else on this list, because you have both the bandwidth and the compute, and you scale by adding cards. On the cards themselves, used 3090s (affiliate) are still the cheapest way to 24 GB, and a new 5090 (affiliate) is the fast end if you'd rather buy current than hunt secondhand.

Skip it when power, heat, and noise matter, or when you'd rather have one capacity number than a sharding problem. Two to four cards plus a board and PSU to feed them draws 700W and up under load, it's loud, and each card caps you at 24 to 32 GB, so reaching Strix Halo's 96 GB takes three or four of them. It's the most capable option on this list, and also loud and power-hungry enough that you'll want it in a different room than the one you work in.

Mac Studio (M4 Max or M3 Ultra)

Run this when you want unified memory done right and you're comfortable on macOS. The M3 Ultra's 819 GB/s is the real thing, and MLX has matured into a genuinely good local-inference path. It's also silent, barely sips power, and is easily the most polished machine here.

Skip it when you need CUDA or ROCm specifically, because you get neither (MLX and Metal only). And check current configs before you commit, because the launch-day reviews miss something: the 2026 DRAM crunch hit the Mac Studio in two rounds. In March the 512 GB option went away, capping the Ultra at 256 GB, and the 256 GB upcharge jumped. In May the Ultra dropped again, down to 96 GB. So the Ultra now tops out around 96 GB, and the M4 Max somewhere between 96 and 128 GB depending on config and stock. Verify against the live Apple Store right before you buy, because the shortage has been moving these month to month. Either way, the "fit a 405B model on your desk" pitch is, for the moment, not a machine you can actually order. An M5 refresh is expected later in the year, so if huge unified memory is the whole reason you want a Mac, it may be worth waiting.

DGX Spark and its clones (ASUS, Dell, HP, Lenovo, Acer, MSI, Gigabyte)

Run this when you specifically need CUDA, NVIDIA's full software stack, FP4/Blackwell features, or development parity with what you'll deploy to datacenter NVIDIA hardware. The Spark is a prototyping and dev box first: the same CUDA libraries as the big DGX systems, in a 1-liter case. It also has a genuine clustering option: two units connect directly over their built-in ConnectX-7 200GbE ports into a 256 GB pool, which none of the others here match cleanly. If you do want this class, the OEM clones sometimes list below NVIDIA's own Founders Edition, often with less onboard storage, so check the config. The ASUS Ascent GX10 (affiliate) is the one most commonly in stock.

Skip it if you're buying it for inference throughput, because 273 GB/s is Strix Halo territory and the "1 petaFLOP" headline only holds for sparse FP4, a narrow case. It's caught the same DRAM crunch too, with a price hike pushing the Founders Edition past $4,000. The premium is worth it if you actually need CUDA and the ecosystem. It isn't worth it if what you want is tokens per second, because the bandwidth doesn't deliver them.

Strix Halo (Ryzen AI Max+ 395)

Run this when you're a single user, your models are MoE-shaped, and you care more about capacity per watt per dollar than peak speed. This is the one I run, and for maybe 80% of what I'd otherwise hand to a small cloud model, an MoE model like Qwen3.6-35B-A3B does 45 to 50 t/s on it. That's honestly fast enough that I stop noticing the latency. You get up to 96 GB of model space on the GPU side, in a box that stays under 200W and costs less than a high-end gaming card.

Skip it if you run dense models, they crawl here, the same chip only manages about 6 t/s on a 31B dense one. It's also not for serving more than a couple of people at once. And if your workflow leans on ROCm for image or video gen or training, that part is still rough, which I get into below.

If Strix Halo is your class: the software stack decides whether ROCm works

This is the part that cost me the most to learn, and I had it wrong at first.

The ROCm problem I kept hitting was on the PyTorch and HIP side, specifically image and video generation through ComfyUI. The moment a model's text encoder loaded onto the GPU, the HSA runtime threw a fault and the process was dead. I filed it as ROCm issue #6182, worked through fourteen configurations, flashed a newer BIOS, and got nowhere. For inference I'd already moved to Vulkan through Mesa's RADV driver, which has run everything I publish since with no slowdown I can measure.

For months I assumed I couldn't run models under ROCm at all. I never actually checked: the image and video stack hit #6182, and I'd never built a HIP llama.cpp to test inference on its own. Vulkan carried everything, so there was no reason to. When I finally tested it, small models loaded clean on the exact same board that supposedly couldn't do ROCm. The board was never the problem. The proof is in the fault itself: when #6182 fires, dmesg is silent, no GPU page fault, no ring reset, nothing kernel-side. The whole failure lives in the HSA userspace runtime. So #6182 is a software bug, and the people running ROCm fine on the same Sixunited board were just on a stack that doesn't hit it.

That covered the small models. The large ones, the size you'd actually buy this box for, were a separate problem, and the practical fix is a single flag. With memory-mapping on, which is llama.cpp's default, HIP tries to keep the model's file-backed pages GPU-resident, and on this APU that hangs. Pass --no-mmap and llama.cpp allocates fresh buffers and copies the weights in instead, and the same model runs. I checked it across ROCm versions and across a prebuilt and a self-compiled build. So ROCm runs the big models here too, you just need --no-mmap. The flag is a workaround for a memory-configuration cause underneath it, which I will cover in a follow-up. For buying, the flag is all you need.

I still run Vulkan. On my box it's about a quarter faster on decode (58 versus 47 t/s on a 35B MoE), and that's the number I care about most, so the practical choice never hinged on any of this. But ROCm inference does work here now, which is the part I'd wrongly written off.

What that means for buying:

For inference, Vulkan is the reliable path on every one of these boxes, and on mine it's a bit faster on decode than ROCm. ROCm runs the same models fine once you pass --no-mmap, but since Vulkan is quicker here, you don't need it.
Where ROCm still bites is the PyTorch and HIP path, image or video generation and training. That's what #6182 covers, and it's still open on this box, so if it's your use case, test it in week one rather than assuming it works.
The board barely enters into it. The chip is identical across these boxes, most of them are the same Sixunited board anyway, and which software stack you run, down to a single flag, matters far more than whose logo is on the case.

So here are the boxes. What actually differs between them is price, availability, and memory config, not a per-board ROCm verdict.

Box	Where it sits	Max RAM	Notes
Bosgame M5	Value	128 GB	What I run, on Vulkan. Sixunited AXB35-02 board. Sold direct, no affiliate link from me.
GMKtec EVO-X2 (affiliate)	Value	128 GB	Same Sixunited board as mine. The 96 GB config is the cheapest way into this class.
Minisforum MS-S1 Max (affiliate)	Value to middle	128 GB	The box in the benchmark I reproduced. Often the most aggressively priced 128 GB option.
Beelink GTR9 Pro (affiliate)	Middle	128 GB	Shipping now, and the strongest networking here with dual 10GbE. Its price has swung the most of any box on this list, so check the live listing. One caveat: the 10GbE can drop under sustained GPU load on current firmware.
Framework Desktop	Middle to premium	128 GB	Open and repairable, and you pay a little extra for that. No affiliate link from me, and I recommend it anyway.
AMD Ryzen AI Halo (Micro Center, check availability)	Premium	128 GB	You pay the premium for a stack AMD assembled and validated. See my full piece on it.

I've left hard prices off this table on purpose. The memory crunch is moving these boxes weekly, and they vary by seller and region, so any number printed here would be wrong within weeks. As a rough anchor, the 128 GB configs have lived in the $2,000 to $4,000-plus range, with 96 GB configs several hundred dollars less. The column above is relative positioning. For today's real number, check the live listing before you buy.

The honest read across that table: the GMKtec at 96 GB is the value pick, the Minisforum is usually the most aggressive on price, the Bosgame and Beelink sit in the middle with Beelink carrying the best networking, Framework is the one to buy if openness and repairability rank high (and no, I make nothing on it), and the AMD box is what you buy when you'd rather pay a clear premium for a stack AMD assembled and validated than wire it up yourself. They all have the identical chip, and most share the same board, so none of that spread is about the silicon.

Buy the memory now, because you cannot add it later

This one applies to everything on this page: the memory is soldered. Strix Halo, Mac Studio, DGX Spark, they all run LPDDR5X soldered to the package, and there's no upgrading it later. Whatever you buy is what you have for the life of the machine.

That matters more than usual in 2026, because the DRAM market is in a crunch. The Spark took a $700 price hike. Apple cut the Mac Studio's big-memory options back to around 96 GB. Prices are up everywhere and the big configs are scarce. So decide what you need to run and buy that capacity now. If you're on the fence between 64 GB and 128 GB on a Strix Halo box, get the 128 GB. You can't add it later, prices keep climbing, and the headroom is what keeps the box useful as models grow.

The one thing you can add later on these mini PCs is storage. They take standard NVMe, and that market is saner. Buy modest storage up front and expand it when you need to. A Gen4 drive like the Samsung 990 Pro (affiliate) is more than enough here; all it does is load weights off disk.

Expansion and the Thunderbolt question

People ask whether they can add hardware later to get past the bandwidth ceiling. Mostly, no, and it's worth knowing why before you buy expecting to.

Most current Strix Halo boxes give you USB4 at 40 Gbps. The Mac Studio has Thunderbolt 5 at 120 Gbps, three times faster, and TB5 is starting to appear on newer mini PCs (verify per board, it's not universal yet). TB5 is genuinely nice for fast external storage and peripherals, like a TB5 portable SSD from Sabrent (affiliate). It won't fix the inference ceiling, though:

External GPU. An eGPU over Thunderbolt is capped by the link itself, well below native PCIe, and its VRAM is a separate pool from the unified memory anyway. For inference you can't just combine the two into one capacity number, so it won't get you the bigger pool of fast memory you're actually after.
Multi-box clustering. You can network two boxes, but over Ethernet or USB4, which is slow next to the dedicated 200GbE ConnectX link that lets two DGX Sparks act as one 256 GB machine. So if clustering is central to your plan, that's really a point for the Spark.

So treat these as fixed-capability machines. The only real upgrade path is the next silicon generation: Medusa Halo with LPDDR6, which is a new machine, not a card you add. Buy for what runs well today, and plan around that staying fixed.

The decision, compressed

You run dense models, serve multiple users, or train and fine-tune, and you can live with the power and noise: multi-GPU rig.
You want the most unified-memory bandwidth and capacity, polish, and silence, you're fine on macOS, and you can get the config you need: Mac Studio (M3 Ultra), but check what memory is actually orderable right now.
You need CUDA specifically, FP4, datacenter dev parity, or clean two-unit clustering: DGX Spark or a clone, and don't expect inference speed for the money.
You run single-user MoE workloads and want capacity per watt per dollar: Strix Halo, on Vulkan, with ROCm a working option too if you pass --no-mmap.
Inside Strix Halo: GMKtec EVO-X2 or Bosgame M5 for value at 96 GB, Minisforum MS-S1 Max for the most aggressive price, Beelink GTR9 Pro in the middle (Beelink for the networking), Framework Desktop for openness, the AMD Halo Box if you'd rather pay for a pre-validated stack than wire it up yourself.

What this is, and what I haven't tested

I'm confident about the bandwidth specs and the ranking they produce. The specs are public, and the ranking is just physics: more bandwidth means faster generation. Which class I'd point you toward is my own choice, after a lot of reading and research. I run a Strix Halo box every day, so that one is firsthand. I haven't run a Mac Studio, a DGX Spark, or a multi-GPU rig myself, so for those I'm going on their specs and on reports from people I trust. Treat the secondhand parts as less certain.

The one thing I'd most want you to take from this: on Strix Halo, the chip and the board are both basically commodities, and the real variable is the software stack and how you configure it. More than once I was sure this box had a hardware fault, and each time it turned out to be software. That cuts both ways, so I will be exact: the large-model hang is gone, it came down to memory configuration and a single flag works around it, but the image-gen fault is also software and still open, so that path doesn't work yet. Either way it was never the board. So if someone tells you a particular box does or doesn't run ROCm, ask which stack and which flags they're on, because that's what actually decides it. For inference, Vulkan sidesteps the whole question and just works.

I keep this guide current as boards ship, firmware moves, ROCm gets fixed (or doesn't), and the M5 and Medusa Halo generations land. Subscribe and you'll get the updates, plus the weekly field reports behind every claim on this page. No vendor decks.