Post 1 closed with a promise: I'd benchmark ROCm against Vulkan/RADV on this exact hardware, with real numbers, before I took a position. This is that benchmark. 128 runs across two backends, two models, and two llama.cpp versions over several evenings. The data is in.
The headline result is the one the community asked for: on Qwen 3.6-35B-A3B MoE, Vulkan/RADV is faster than ROCm 6.4 by 25–32 % on generation throughput. That's the conventional question, and the answer is somewhat conventional. The community hypothesis going in was "ROCm wins prefill, Vulkan wins generation". The data says Vulkan wins both, though the prefill margin is smaller and direction-dependent.
The more interesting finding is what doesn't run at all. On Gemma 4 31B Dense — same hardware, same llama.cpp commit, same backend install — ROCm fails every single one of 48 runs across two versions, in three distinct failure modes. AMD's first official gfx1151-supported release ships a backend that cannot complete a single inference on this model at production context size.
That's not the post I expected to write. It is, I think, the more useful one.
The Setup
Same hardware as Post 1: Bosgame M5, Ryzen AI MAX+ 395, gfx1151, 96 GB BIOS-allocated VRAM, Fedora Server 43. Same kernel parameters. Same daily-driver llama.cpp build pinned to b9016 (846262d78, 4 May 2026) for fairness with Post 1's published Vulkan numbers.
For this bench I added a parallel ROCm-enabled build at the same source commit, installed alongside the Vulkan build:
/opt/llamacpp/vulkan/bin/llama-server
/opt/llamacpp/rocm/bin/llama-server
ROCm 6.4.4 went in via the standard Fedora 43 stack dnf install rocm-hip-runtime rocm-llvm hip-runtime-amd). One install-time wrinkle worth flagging: rocWMMA 6.4.0 — the version Fedora ships today — doesn't enumerate gfx1151 in its config.hpp, so building llama.cpp with GGML_HIP_ROCWMMA_FATTN=ON fails with static_assert(0, "Unsupported architecture"). I built ROCm with WMMA Flash Attention off. The fix exists in rocWMMA 7.x; it hasn't trickled to Fedora yet. Note it for later.
The matrix: two backends (Vulkan/RADV, ROCm/HIP) × two models (Qwen 3.6-35B-A3B MoE in UD-Q5_K_XL, Gemma 4 31B Dense in Q8_0) × four prompt lengths (31 / 461 / 3799 / 11205 tokens). For ROCm I also tested two K-cache quantizations f16 default, q8_0) since Vulkan hangs deterministically on Unsloth Dynamic quants with cache-type-k=q8_0 and I wanted to know whether that hang is backend-specific. Five runs per cell, run 1 dropped as warmup, statistics on runs 2–5. Streaming TTFT measured client-side. VRAM peaks captured via 0.5-second sysfs polling. Thermal cooldown to edge < 60 °C between runs.
When the ROCm × Gemma cells failed in unexpected ways, I additionally pulled llama.cpp to master (dbe7901ca, 14 May 2026) and re-ran the failing cells to rule out a stale-build issue. More on that below.
Qwen 3.6-35B-A3B (MoE) — the horse race
Both backends complete all runs. Generation throughput, mean ± stddev across the four reported runs per cell:
| Prompt length | Tokens | Vulkan | ROCm f16 | ROCm q8_0 |
|---|---|---|---|---|
| short | 31 | 55.6 ± 0.1 | 42.1 ± 4.3 | 43.1 ± 0.3 |
| medium | 461 | 55.5 ± 0.0 | 43.7 ± 0.4 | 42.2 ± 0.2 |
| long | 3799 | 54.1 ± 0.0 | 42.9 ± 0.4 | 35.6 ± 0.6 |
| very-long | 11205 | 52.0 ± 0.0 | 40.6 ± 1.2 | 28.2 ± 0.2 |
Numbers are tokens per second, generation phase only.
Prompt processing throughput:
| Prompt length | Vulkan | ROCm f16 | ROCm q8_0 |
|---|---|---|---|
| short | 239 ± 1 | 247 ± 2 | 254 ± 10 |
| medium | 870 ± 1 | 694 ± 187 | 757 ± 12 |
| long | 993 ± 1 | 921 ± 23 | 587 ± 8 |
| very-long | 955 ± 1 | 901 ± 7 | 351 ± 4 |
Three things from this data:
Vulkan wins generation by 25–32 % across all prompt sizes, consistently. The deltas are large enough to feel — a 50-token reply takes about 1 second on Vulkan and 1.2–1.3 seconds on ROCm. At interactive scale these compound; over a coding session you notice.
Vulkan also wins prompt processing once context grows past trivial — by 6–8 % at long and very-long contexts, much narrower than the generation lead. ROCm edges Vulkan at the 31-token short prompt (247 vs 239 t/s — within the bounds of per-batch overhead noise) but Vulkan pulls ahead at every meaningful context size. The "ROCm wins prefill" intuition that floats around r/LocalLLaMA does not hold for gfx1151 on this codepath. I don't have a clean explanation; this is what the bench showed.
Quantizing the K-cache to q8_0 on ROCm is mostly a regression on this hardware, and the regression scales with context length. Short-prompt generation is unchanged (the cache is small enough that storage format barely matters), but as context grows the cost compounds: at the 4K long prompt q8_0 costs 17 % generation throughput and 36 % prompt-processing throughput relative to f16 cache on the same backend; at the 11K very-long prompt the gap widens to 30 % gen / 61 % prompt. That's a meaningful penalty for what's normally sold as a memory-saving optimization. On this hardware, the K-cache-quantization decompression cost outweighs the bandwidth savings, at least for this model, and the cost grows with how much cache you're actually reading. On Vulkan I couldn't test the q8_0 cache directly — it hangs deterministically on Unsloth Dynamic quants, a separate documented bug. So neither backend handles cache quantization gracefully here, just in different ways.
If the post stopped at this section, the verdict would be: Vulkan/RADV is the faster backend, modestly. Pick by other factors (driver maturity, debugging tools, ecosystem) if MoE is your only workload. It doesn't stop here.
Gemma 4 31B (Dense, Q8_0) — where ROCm breaks
Same hardware. Same llama.cpp source commit. Same model file gemma-4-31B-it-Q8_0.gguf, 33 GB). Different model architecture: Dense, not MoE.
Vulkan: 6.0–6.4 tokens per second generation across all four prompt lengths. All 20 runs complete without error. The number itself is roughly what bandwidth math predicts — 256 GB/s ÷ ~33 GB per forward pass = ~8 t/s ceiling, real-world overhead drags it to 6. Slow, but inference-shaped slow. You can use it for batch jobs.
ROCm: zero successful runs. Zero out of 40 on b9016, zero out of 8 on master. Three distinct failure modes, all reproducible:
| Configuration | Failure mode | Evidence |
|---|---|---|
b9016, ctx=65536, default |
ROCm error: out of memory in hipGraphInstantiate at ggml-cuda.cu:4423 |
40/40 deterministic crashes |
master (dbe7901ca), ctx=65536, default |
same hipGraphInstantiate OOM, same source line |
8/8 deterministic crashes |
any version, ctx=4096, default or workaround |
Coherent loading, garbage output at 0.5 t/s | 3 smoke-test confirmations |
A few things are worth pulling out of that table.
The crash is not a stale-software story. I pulled llama.cpp to master ten days after my pinned b9016 commit specifically to rule out the obvious objection. 130 commits later, the failing line of code — hipGraphInstantiate at ggml-cuda.cu:4423 — crashes identically. Between b9016 and master there's a fix for a different MoE-Gemma issue (#21416, closed 8 April 2026 via PR #21566). That fix doesn't touch the Dense-Gemma case on gfx1151. The bug is unchanged.
The OOM message is misleading. There's 90 GB of VRAM free at the point of the crash. The model loads. The crash happens at the first decode step, during graph instantiation. This isn't classical OOM — it's HIP's compute-graph capture path hitting some limit that the message names "out of memory" but doesn't appear to actually be about memory pressure. It's specific to gfx1151 (Qwen MoE on the same hardware never hits this path) and specific to whatever Gemma 4's compute graph looks like to the HIP capture layer (SWA + interleaved attention is the leading hypothesis, but I haven't isolated it past that).
The smaller-context behavior is informative. At ctx-size 4096 instead of 65536, ROCm's failure mode changes from crash to wrong-output. The graph capture doesn't blow up at the smaller allocation, but the actual inference produces broken results at 0.5 tokens per second. Which leads to the next finding.
Forensics: where exactly is broken
When ROCm produces "garbage" Gemma output, it isn't random bytes. It's decoded tokens. Two distinct degenerate-loop patterns from the smoke tests:
master default, ctx=4096, short prompt:
"\n {//K//K//K//K//K//K//K//K//K"
Tokenizer decomposition: token 107 '\n'), token 34130 ' {//'), then a two-token alternating loop of token 236855 'K') and token 715 '//').
master + GGML_CUDA_DISABLE_GRAPHS=1, ctx=4096, short prompt:
"\n\n {//////////////////////////////////////////////////////"
Tokenizer decomposition: token 108 '\n\n'), token 642 ' {'), then a one-token loop on token 10767 '////////////////', sixteen slashes in a single BPE token).
These are not corrupted memory or bit-flips. They are valid token IDs, drawn from the model's vocabulary by the sampler, written out by the tokenizer in the correct format. The pipeline is working — except the probability distribution being sampled from has collapsed onto one or two tokens. That's the signature of an attention-math failure: somewhere in the compute, the logits going into the softmax have become numerically degenerate (NaN-poisoned, infinite, or all collapsed onto an identical maximum). Sampler picks the same one or two tokens forever.
That the patterns differ between the default path and the workaround path is also useful diagnostic information. Different HIP code paths trigger somewhat different numerical failures. The bug isn't a single location — it's a class of failures across the attention compute that this hardware and ROCm version produces for this model architecture.
So we can locate the broken component fairly precisely: it's in the GPU compute for attention on Gemma 4's specific architecture (SWA + interleaved layers), under HIP on gfx1151. Tokenizer, sampler, weights, model file, and the rest of the pipeline are all fine. They would have to be — Vulkan/RADV runs the same model file successfully at 6 t/s on the same hardware.
The workaround that isn't
The community-known workaround for HIP-graph problems is GGML_CUDA_DISABLE_GRAPHS=1, which falls back to direct kernel launches instead of graph capture. I tested all four quadrants:
| default | GGML_CUDA_DISABLE_GRAPHS=1 |
|
|---|---|---|
b9016 (4 May), ctx=4096 |
not tested | garbage @ 0.5 t/s |
b9016 (4 May), ctx=65536 |
crash | not tested |
master (14 May), ctx=4096 |
garbage @ 0.5 t/s | garbage @ 0.5 t/s |
master (14 May), ctx=65536 |
crash | not tested |
The env-var doesn't fix anything. At small context it prevents the graph instantiation but the inference is still numerically wrong. At production context size I didn't test the workaround directly — based on the failure pattern across every other configuration the workaround almost certainly hits the same crash, but that's an inference rather than a measurement. What's measured: no tested configuration produces correct Gemma 4 31B output on ROCm 6.4.4 + gfx1151.
What this means in practice
For MoE workloads on gfx1151 today, both backends work, and you can pick by other factors. Vulkan is ~25 % faster on generation; ROCm has a more mature tooling ecosystem if you need to debug at the kernel level, but for inference-only consumption that mostly doesn't matter. Run Vulkan if speed is the primary axis; run ROCm only if you have a specific reason to.
For Dense models on gfx1151 today, the decision is made for you. Until ROCm patches the Gemma case (and presumably the class of similar architectures it represents — anything using SWA + interleaved attention is suspect), Vulkan is the only working path. Six tokens per second isn't fast — but six tokens per second is infinitely more than the zero ROCm produces.
One adjacent finding worth documenting: ROCm's VRAM accounting is missing entirely from the standard sysfs path. Both Qwen MoE configurations on ROCm reported ~3.86 GB of VRAM used (constant across all prompt sizes) when reading mem_info_vram_used, while Vulkan correctly reported ~30 GB for the loaded model. The bytes are allocated — rocm-smi --showmeminfo vram shows the real number — but they're not visible through the DRM buffer accounting that almost every memory-monitoring tool reads. If you're trying to track VRAM usage with radeontop, nvtop, or a custom Prometheus exporter, ROCm-allocated memory will look invisible. Worth a heads-up to anyone wiring up local observability.
What I didn't test
This bench scope deliberately excluded several knobs:
- ROCm 7.x has explicit gfx1151 enumeration in rocWMMA, which would unlock hardware-accelerated WMMA Flash Attention. It's plausibly the fix for the Gemma case. It's not in Fedora 43's first-party repos — requires either a pinned RHEL 9 repo, a COPR, or a container. That's a different setup story, and a different post. I'll revisit it.
- AMDVLK as an alternative Vulkan ICD. Community reports it can win on prompt-processing for MoE. Possibly worth a follow-up.
- GGML_HIP_NO_VMM=ON at build time. Some community write-ups suggest HIP VMM is broken on gfx1151 and this flag fixes a different class of issues. AMD's own docs don't recommend it. Could plausibly affect the Gemma crash. Out of scope for this run.
- Vulkan with
cache-type-k=q8_0on Gemma. Gemma uses SWA, which has a separate documented incompatibility with cache reuse. Wasn't part of the fair-comparison scope here.
Each of those is a credible follow-up. None of them changes the core verdict for what's installable on Fedora 43 today.
Verdict
For Strix Halo as of mid-May 2026: run Vulkan/RADV.
ROCm 6.4.4 — the first release AMD claims as officially supporting gfx1151 — is a partially-shipped product on this hardware. It runs MoE workloads, more slowly than Vulkan but usefully. It does not run Gemma 4 31B Dense in any tested configuration. The bug reproduces on master ten days later. The advertised workaround doesn't restore correct output.
The 96 GB of VRAM that motivated buying this hardware in the first place doesn't help when the compute backend can't run the model. Until ROCm patches this — and I'll know when it does, because I'll re-run the bench when ROCm 7.x lands cleanly on Fedora — Vulkan is the only credible path.
The raw data, the methodology document, and the anomaly log are at github.com/thefrontierlab/post2-bench. If you can reproduce or refute this on similar hardware, I'd be interested to hear about it — corrections land in my inbox if I got something wrong.
Coming up: ROCm 7.x revisited. Different repo path, different rocWMMA, different llama.cpp build options. Same hardware, same bench harness. I want to know whether the Gemma case stays broken or just got reshuffled.
Discussion