Field report · 001

What 96GB of VRAM on Unified-Memory Hardware Actually Gets You for Local LLM Inference

An honest practitioner take from a Bosgame M5 running Strix Halo at full BIOS allocation.

Author
Date
2026-05-09
Read
8 min read

When I bought a Bosgame M5 — Ryzen AI MAX+ 395, 128GB unified memory, 96GB allocated to the iGPU in BIOS — I had a specific assumption in mind: more memory means bigger models. That's the entire pitch of "AI workstation" hardware. Strix Halo, Apple Silicon, NVIDIA's DGX Spark — they all sell against the same intuition. Cloud-grade context windows on hardware you own.

A few weeks of running this thing daily, and that framing turned out to be wrong. Not slightly wrong. Wrong in a way that changes which models are even worth considering on this class of hardware.

The headline number: Qwen 3.6-35B-A3B — a Mixture-of-Experts model with roughly 3B active parameters — generates 45 to 50 tokens per second on this machine. Gemma 4 31B Dense, almost the same parameter count on paper, generates 6. That's an eight-fold gap on identical hardware running the same backend (Vulkan/RADV on Fedora Server 43). One is a real-time conversation partner. The other is a background batch worker.

This post is about why that gap exists, what actually runs well on 96GB of GPU-allocated unified memory, and what it means if you're considering this class of hardware for serious local inference. No demos, no cherry-picked synthetic-prompt benchmarks — just what runs on my machine, what I use it for daily, and where it stops being useful.

If you came here expecting "96GB is the new VRAM," sorry. The truth is more interesting than that.

Why Unified Memory ≠ VRAM

VRAM and unified memory are not interchangeable. They look the same on paper — both are "memory the GPU can use" — but they're optimized for different things, and that difference is the entire reason this hardware class exists.

Strix Halo runs LPDDR5X-8000 in a quad-channel configuration. Theoretical bandwidth: about 256 GB/s. An RTX 4090 clocks around 1 TB/s of VRAM bandwidth. RTX 5090: closer to 1.8 TB/s. That's a 4× to 7× gap, on memory subsystems that on the surface look comparable.

For LLM inference, that gap matters more than almost anything else. Token generation is memory-bandwidth-bound: each token requires reading the model's weights from memory at least once per forward pass. The faster you can pull bytes through the bus, the more tokens per second you produce. This is why a 12GB RTX 4070 will outpace a 96GB Strix Halo on any model that fits in 12GB — same compute regime, much faster bandwidth.

But — and this is the part the spec sheets bury — the bandwidth ceiling only matters for what fits. A 4070 holds 12GB. A 4090 holds 24. A 5090 holds 32. Above that, you're either spilling to system RAM (catastrophic for inference latency), paying for a workstation card, or stacking multiple GPUs. On a Strix Halo box you hold 96GB on the GPU side, in a chassis under 200W, in a desktop form factor.

So the tradeoff is clean: VRAM systems give you bandwidth on what fits. Unified memory gives you capacity on what runs. Neither is "better" — they optimize for different failure modes. The 4090 fails when your model is bigger than 24GB. Strix Halo fails when your model needs to push more bandwidth than 256 GB/s can deliver.

Which failure mode you hit depends almost entirely on architecture. That's where the real story is.

The Setup

Most of what follows is specific to Strix Halo on Fedora 43, but the principles transfer to any unified-memory inference box: BIOS allocation, kernel parameters, mmap behavior, and how you partition workloads across processes.

BIOS: 96GB to the iGPU. The Bosgame M5 ships with 128GB total. BIOS lets you allocate up to 96GB as dedicated VRAM, leaving 32GB for the system. I went with the maximum. The other common option, 64GB, caps your model size hard, and 32GB of system RAM is more than Fedora Server and llama.cpp will realistically use. The math is one-sided: more VRAM, more options. The only argument for less is heavy concurrent system workloads — which a dedicated inference box doesn't have.

OS: Fedora Server 43, headless. SSH-only access. The kernel command line carries three parameters that matter: amd_iommu=off, amdgpu.gttsize=126976, and ttm.pages_limit=32505856. Without the first two, the GPU driver either fails to map the full memory region or hits stability issues under sustained load. Without the third, TTM (the amdgpu DMA-buffer manager) caps the number of simultaneously allocatable pages, and you hit "failed to allocate buffer" errors when loading large models — even with plenty of free RAM available. Defaults will boot. They will not run a 35B model reliably overnight.

Backend: Vulkan/RADV. Mature on gfx1151, low setup friction, and performance lands where the bandwidth math predicts (see numbers below). ROCm 6.4 has official gfx1151 coverage now — so a head-to-head with HIP is finally a meaningful comparison. I haven't run that test yet; it's the next post.

llama.cpp with --no-mmap. Standard mmap behavior is incompatible with how Strix Halo's BIOS-allocated VRAM works in practice. Pages don't stay resident, and you get unpredictable performance dips under load. --no-mmap forces a full upfront load. Boot times go up; everything else gets predictable.

Three llama.cpp processes, three ports. 8080 for on-demand large models, 8081 for the always-loaded conversational hot path (Qwen 3.6-35B-A3B plus an embedding model), 8082 for the reranker. Why three separate processes instead of a single router? Different post.

What Actually Runs Well

So here's the comparison the spec sheets won't show you. Two models, comparable parameter counts, same hardware, same backend. Different worlds.

Qwen 3.6-35B-A3B (MoE, ~3B active)

Tokens per second: 55–58 on cold short prompts, settling to 45–50 t/s as conversation context grows (KV-cache reads pull effective bandwidth down). Time-to-first-token feels instant on short prompts; with 8K–16K of context loaded, it grows but stays in the range I'd call snappy. I run this as my daily driver for tool-use, code research, and anything where I'm in a conversational loop with the model. Quantization: a balanced GGUF quant in roughly the Q5–Q6 byte-per-weight range. Footprint sits comfortably inside the 96GB envelope with generous room for the KV-cache.

Subjective: this feels like running a competent cloud model locally. Not GPT-5 or Sonnet. But for the 80% of tasks I'd otherwise route to a small cloud model, it's fast enough that I stop thinking about latency.

Gemma 4 31B (Dense, Q8_0)

Tokens per second: 6. That's not a typo. Same hardware, same backend, model size on paper roughly comparable to Qwen 3.6.

Six tokens per second is, technically, faster than human reading speed. You can run a streaming chat with it. But the moment the model produces more than a paragraph — a code review, a longer synthesis, agent-style multi-step output — the wait is real. Time-to-first-token is noticeably slower than the MoE, and grows more aggressively with context length.

I don't use Gemma 4 31B for interactive work. I use it for batch jobs where I queue prompts and walk away. That's a legitimate use case. It's not what I bought 96GB of unified memory to do.

(Practitioner caveat: Gemma is at Q8_0 in my testing — the highest-quality quant short of full precision. A Q5 or Q4 Gemma would generate faster, plausibly 10–12 t/s on the same hardware. That doesn't close the gap with Qwen. It narrows it from 8× to maybe 4–5×. The architecture difference dominates.)

Why the gap is structural, not a benchmarking artifact

Memory-bandwidth-bound inference scales with active parameters per token. Strix Halo's 256 GB/s of bandwidth gets divided by however many bytes you have to read on each forward pass.

Qwen 3.6-35B-A3B activates around 3B parameters per token. At a balanced quant, that's roughly 2GB of weights to push through the bus per generated token. 256 / 2 = 128 t/s as a theoretical ceiling. Real-world overhead drags it to 45–58 t/s — about what you'd expect from a well-tuned llama.cpp setup.

Gemma 4 31B Dense activates all 31B parameters per token. At Q8_0 that's around 31GB to push per token. 256 / 31 ≈ 8 t/s ceiling. Real-world: 6. Same overhead pattern, much lower ceiling.

The math cleanly explains the observation. This is not "MoE is hyped, dense is slow because reasons" — it's a direct consequence of how memory-bandwidth-bound inference works on this class of hardware. Any architecture you bring to a 256 GB/s memory subsystem will obey the same equation.

The implication

If you're shopping for unified-memory hardware — Strix Halo, Apple Silicon, DGX Spark — and the models you care about are dense, you're going to be unhappy. A 70B dense model in a quant that fits would generate around 4 t/s on this machine. That's not interactive. That's a cron job.

If the models you care about are MoE — and a lot of the most interesting recent open-weight releases are — this hardware is genuinely transformative. You get cloud-tier interactive speed on weights you control, in a sub-200W chassis, for less than a high-end gaming GPU.

The right question before buying isn't "how much memory do I need?" It's "is the architecture I want to run going to be friendly to memory bandwidth?"

What 96GB Won't Save You From

No hardware solves every problem. Here's what 96GB of GPU-allocated unified memory specifically does not solve.

The biggest open-weight models still don't fit. GLM-5.1 (744B parameters) and the 400B-class Qwen MoE releases sit outside the envelope at any reasonable quant. MoE architecture helps with speed, not capacity — total parameters still need to live somewhere, and at Q4 those models already need well past 96GB. If running the absolute largest open-weight releases is your goal, this hardware isn't your endgame.

Time-to-first-token grows visibly with context length. Generation runs at 45–50 t/s; the prefill phase (prompt processing) by contrast runs around 130 t/s on Qwen 3.6-35B-A3B — because prefill is compute-bound, not bandwidth-bound. Weights are read once and multiplied against the entire prompt as a batched GEMM, instead of once per generated token. Per-token prefill is cheap, but at long prompts (large RAG contexts, repo-wide code search, multi-document summarization), latency × token count is real and you feel it. The constraint here is compute, not the 256 GB/s memory bus.

Multi-user serving is constrained. Single-user interactive use is the sweet spot. If you're trying to run this as a serving endpoint for a small team, concurrent requests will compete for the same bandwidth budget. You can do one or two simultaneous active sessions; you can't do ten. This is a workstation, not a server.

You will still need cloud, sometimes. When a new frontier model drops and you need to evaluate it. When a task genuinely requires Sonnet-tier or GPT-5-tier reasoning. When you need vision, audio, or multimodal capabilities the open-weight ecosystem hasn't caught up to yet. 96GB local doesn't replace a Claude or GPT-5 subscription — it complements one.

The headline I started with — "more memory means bigger models" — turns out to be a half-truth that costs you money if you act on it without nuance.

The real story: on unified-memory hardware, model architecture matters more than model size. Memory bandwidth is the binding constraint for inference speed, and architectures that activate fewer parameters per token (MoE) get an order-of-magnitude speed advantage over dense architectures of the same nominal size. 96GB is enormously useful — but only for the architectures that can use it productively.

If you run open-weight MoE models for interactive work, do most of your inference single-user, and value local sovereignty over absolute capability ceiling, this hardware is a genuine inflection point. You get cloud-tier interactive speed on weights you control, in a chassis that fits under your desk.

If you run primarily dense models, need to serve multiple concurrent users, or need access to the absolute frontier of capability, you are buying for the wrong constraint. A different machine — or a cloud subscription — will serve you better.

Coming up: Vulkan/RADV vs ROCm benchmarked on this exact hardware, with the same models, same prompts, same llama.cpp build. Community reports go both ways. I want my own numbers before I have an opinion.

Discussion

More like this — every Thursday.

Field reports from running open-weights models on the hardware you actually own. No pitches.