Field report · Benchmarks

ROCm 7.x on the Bosgame M5: 14 Configurations, 14 Failures

We promised a ROCm 7.x revisit. We got a comprehensive workaround sweep instead. Both are useful.

Author

Erik

Date

2026-05-22

Read

9 min read

Topics

Field ReportROCm

A note on links: some hardware below is linked through affiliate programs, marked (affiliate). If you buy through one, I earn a small commission at no cost to you. It never changes what I recommend or what I run. These are the boxes I'd point you to either way.

Post 2 closed with a promise: when ROCm 7.x landed cleanly on Fedora, I'd re-run the bench. ROCm 7.x has landed — just not in Fedora 43's first-party repos, and not cleanly on this specific board.

I systematically tested 14 different configurations of ROCm 7.x on the Bosgame M5. All 14 reproduce ROCm Issue #6182 identically. The bug fires at the HSA scratch buffer setup phase, before any model weights are loaded. None of the workarounds change the outcome: six HSA environment variable configurations, three kernel command-line variants, two alternate stacks (TheRock-nightly llama.cpp + TheRock-built vLLM), three BIOS UMA carveouts.

This is not the ROCm 7.x revisit I planned. It's a smaller post than I planned. The narrower scope is its own value: if you own a Bosgame M5 or a Sixunited AXB35-02 motherboard, this post tells you what every plausible workaround actually does today: nothing.

Why I expected ROCm 7.x to help

Going in, the reasoning for the revisit was clean. Post 2 showed ROCm 6.4 failing every Gemma 4 31B Dense run on this hardware. The community had moved on:

AMD's own Strix Halo optimization page explicitly recommends ROCm 7.2.1+ over 6.4.x for gfx1151.
rocWMMA 7.x enumerates gfx1151 in its build target config — the missing piece that forced our Post 2 build with GGML_HIP_ROCWMMA_FATTN=OFF.
Multiple first-hand Reddit and GitHub reports — Beelink GTR9 Pro, Framework Desktop, GMKtec EVO-X2 — show ROCm 7.x running cleanly on Strix Halo hardware with the right install path.
kmarble published a benchmark on May 16 showing ROCm 7.13 working on his Minisforum MS-S1 Max (affiliate) — same Ryzen AI MAX+ 395 chip, different motherboard — at 46 t/s for Qwen 3.6-35B-A3B empty-context decode.

That last data point — a public report from a different practitioner with similar hardware, 24 hours before this bench — was the decisive nudge. If kmarble could get 7.13 working on the same chip, the install path was solved. The question was just whether the failure I documented in Post 2 (Gemma + ROCm = 0/48) was still there.

I never got to the Gemma question. The earlier question — does ROCm 7.x run anything on this board — answered itself.

The hardware that's the problem

This board:

Component	Value
System	Bosgame M5
Motherboard	Sixunited AXB35-02
CPU	AMD Ryzen AI MAX+ 395
GPU	gfx1151 (Radeon 8060S, integrated)
VRAM (BIOS-allocated UMA)	96 GiB (default)
BIOS	American Megatrends 1.07 (release 2025-09-12)
Host OS	Fedora Server 43
Kernel	6.19.10 → 7.0.8 (auto-upgraded mid-run; both well past the 6.18.4 community floor)
linux-firmware	20260410 (past the warned-against 20251125)
Host ROCm (versionlocked, untouched throughout)	6.4.4 (Fedora-native)

Issue #6182's bug reporter is on the same board family — Bosgame M5 with the Sixunited AXB35-02 motherboard. GMKtec EVO-X2 and Beelink GTR9 Pro owners with the same Ryzen AI MAX+ 395 chip report ROCm 7.x working. So the failure is not the chip. It's the board, the BIOS, or some interaction at the firmware level we don't have visibility into.

The 14 configurations

To keep the test methodology comparable to Post 2, I used a containerized install path — kyuz0/amd-strix-halo-toolboxes:rocm-7.2.3@sha256:a8e64a9a204bc81e60ea81b8dbe00bb4b04d7685262f0e587b60110e0faad055. Host ROCm 6.4 was version-locked across 66 packages and never modified; OpenClaw (my production inference stack) ran on host-native 6.4 throughout the bench. Smoke test: load Qwen 2.5-1.5B Q4 (986 MB) and produce one token of completion. Tiny model, trivial workload — designed to test only whether the HSA scratch buffer setup completes.

It doesn't.

HSA environment variable workarounds (Phase 1d — same container, same model, different env at server launch):

#	Variant	Result
1	baseline (no env overrides)	fail
2	HSA_OVERRIDE_GFX_VERSION=11.0.0 (force gfx1100 codepath)	fail
3	HSA_OVERRIDE_GFX_VERSION=11.5.1 (Sacco's documented value)	fail
4	HSA_ENABLE_SDMA=0	fail
5	HSA_NO_SCRATCH_RECLAIM=1	fail
6	All three above combined	fail

Kernel command-line variants (Phase 1.5 — each via reboot, host OpenClaw production stable across all reboots):

#	Variant (cumulative)	Result
7	A: amdgpu.vm_fragment_size=9 amdgpu.exp_hw_support=1	fail
8	B = A + amd_iommu=on iommu=pt (toggle from baseline amd_iommu=off)	fail
9	C = B + amdgpu.runpm=0 (disable runtime power management)	fail

Alternate stacks (Phase 1.6 — different ROCm minor versions and inference engines):

#	Stack	Result
10	kyuz0/amd-strix-halo-toolboxes:rocm7-nightlies (llama.cpp + ROCm 7.0 via TheRock nightly)	fail
11	kyuz0/vllm-therock-gfx1151:latest (vLLM 0.19.2rc1 + TheRock-built ROCm)	fail

BIOS UMA carveout sweep (Phase 1.7 — physical BIOS access required; OpenClaw service unavailable only during the reboot transition itself, ≈30s per reboot, fully restored after):

#	UMA carveout	OS-visible RAM	GPU "free VRAM" reported	Result
12	96 GiB (default)	31 GiB	10-26 GiB	fail
13	64 GiB	63 GiB	61 GiB	fail
14	32 GiB	93 GiB	93 GiB	fail

Across all 14: the failure fires at the same point in llama.cpp's startup, in the same function, with the same message format. Wall-clock-to-failure varied between 2 and 60 seconds — the faster fires (Phase 1.7 at UMA 32 GiB: 2 seconds) correlate with how quickly the startup reaches the memory-probe phase, not with bug severity. Same bug, different paths to the same point of failure.

The failure signature

llama.cpp's startup proceeds through several phases before model weights are loaded. The smoke test reaches step three before crashing:

srv          main: loading model
srv    load_model: loading model '/work/models/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf'
common_init_result: fitting params to device memory ...
common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
Memory critical error by agent node-0 (Agent handle: 0x...) on address 0x... . Reason: Memory in use.

fitting params to device memory is the phase where llama.cpp computes KV cache footprint based on available VRAM. It's not the model weight allocation — that hasn't started yet. The HSA error fires during the memory probe. The runtime sees the GPU, reports 124 GiB total VRAM (UMA + GTT aggregate, as expected), reports anywhere from 10 to 93 GiB free depending on container state — and then the first allocation attempt that touches HSA scratch buffer setup returns the Memory in use error.

This is consistent with how Issue #6182's bug reporter describes the failure: it's pre-model, in the HSA runtime layer, agnostic to model size. Loading a 986 MB model triggers the same failure as loading a 30 GB model. The bug is in how the HSA runtime initializes its own scratch buffers on this BIOS, not in any workload-specific behavior.

The HSA handle address (0x...) and the failure address vary between runs, which rules out a single corrupted memory page or a fixed-offset bug. The agent (node-0) and the reason ("Memory in use") are identical across all 14 configurations.

Cross-reference: why this is board-specific, not chip-specific

Three things distinguish this finding from "ROCm 7.x doesn't work on Strix Halo":

Issue #6182's reporter is on the same Sixunited motherboard family and exhausted 8 kernel cmdline variants plus 5 HSA env vars — a 13-config sweep on a separate physical Bosgame M5 — with identical results to mine. Two independent reports, same board family, same bug, zero workarounds working.
The #6182 thread explicitly notes that GMKtec EVO-X2 owners running the same Ryzen AI MAX+ 395 chip do not trigger the bug. That's a separate motherboard with different BIOS firmware.
Community reports from Framework Desktop and Beelink GTR9 Pro owners show ROCm 7.x working cleanly on those boards. kmarble's blog post explicitly states ROCm 7.13 works on his Strix Halo — different board, same chip, working.

So: the bug is not in ROCm. The bug is not in the gfx1151 silicon. The bug is in something specific to the Sixunited AXB35-02 motherboard's firmware path — likely how its BIOS handles UMA carveout, GTT memory, or HSA queue creation. AMD's HSA runtime, working correctly on every other Strix Halo board I can find evidence of, fails to initialize scratch buffers on this one.

This is the most narrow and specific failure I can imagine documenting. It's also the most expensive failure to discover, because the bug surfaces only after install, only on certain boards, and only at runtime.

What this means in practice

If you own a Bosgame M5 with the Sixunited AXB35-02 motherboard, ROCm 7.x is currently a dead end. Stay on ROCm 6.4 (which itself works for MoE workloads as Post 2 documented, though not for Gemma 4 31B Dense). Or — and per Post 2 this is actually the better choice on this hardware — use Vulkan/RADV, which per Post 2's measurements is 25–32% faster than ROCm 6.4 on MoE generation throughput and runs every model class including Gemma 4 31B Dense at modest but real throughput. The "ROCm is faster" intuition that drives people toward it doesn't hold on gfx1151.

If you're considering buying a Bosgame M5 specifically for local LLM inference: this is a real risk worth knowing about before you wire €2300. The hardware ships with 128 GB of unified memory and that's still real — but the assumption that you can pick whichever AMD inference stack you want is currently wrong on this board.

For affected users, two things to track:

A Bosgame BIOS 1.08 or later, if and when it ships. The current 1.07 (release date 2025-09-12) is what's failing. BIOS updates have fixed analogous HSA-runtime bugs on other AMD platforms.
A central fix from AMD in a future ROCm release (7.3 or later, currently unreleased). The HSA scratch buffer setup path is in librocm-runtime.so; if AMD adjusts how it handles certain BIOS-reported memory descriptors, the bug may disappear without any BIOS change.

If you're on a Bosgame M5 today and you want to verify whether your specific unit is affected: the smoke test is one command. Pull the kyuz0 container image, point it at any small GGUF, run llama-server. If it dies in two seconds with Memory critical error by agent node-0, you have it. The bug is deterministic — there is no "sometimes" with #6182 on this board.

What I didn't test

This bench scope deliberately excluded several paths I considered:

ROCm 7.13 from kmarble's specific TheRock build path — the kyuz0/amd-strix-halo-toolboxes:rocm7-nightlies tag turns out to ship ROCm 7.0 (verified at runtime), not 7.13. Building TheRock from source for 7.13 specifically is a 6-12 hour effort I deferred. Given that 7.0 and 7.2.3 fail identically across the 14-configuration sweep, an inductive inference from that pattern is that 7.13 would fail the same way on this board — but that's an inference from a strong signal, not a 15th measurement.
Bare-metal install from AMD's RHEL 9 repo — would require uninstalling Fedora-native ROCm 6.4, which would break OpenClaw production. The dnf remove + --allowerasing path that AMD's official install docs require was deliberately not executed. Containerized testing was sufficient to surface the bug; bare-metal would have produced the same failure with substantially higher risk.
BIOS UMA values below 32 GiB or above 96 GiB — the BIOS exposes 32, 64, 96, and 128 GiB carveouts. I tested three; the failure at all three suggests 128 wouldn't change behavior, but I didn't verify.
Older BIOS versions — the Bosgame M5 ships with various BIOS revisions depending on the production batch. I have 1.07; some units have 1.06. I didn't roll back.
Different kernel versions — Fedora 43 auto-upgraded my kernel mid-bench from 6.19.10 to 7.0.8. Both are well past the community-recommended 6.18.4 floor. I didn't test older kernels because the community signal says they're worse, not better.

Each of these is a credible follow-up. None of them changes the core verdict for what's reasonable to try on the Bosgame M5 today.

Verdict

For the Bosgame M5 / Sixunited AXB35-02 in mid-May 2026: don't try ROCm 7.x yet. No tested configuration produces working inference. The bug is reproducible, the workaround surface is exhausted, and the fix is upstream — either at AMD's HSA runtime layer or in Bosgame's BIOS.

Vulkan/RADV is the right primary path on this hardware: per Post 2 it's faster than ROCm 6.4 for MoE generation, and it's the only stack that runs Gemma 4 31B Dense at all. ROCm 6.4 (Fedora-native, my OpenClaw production stack) stays as a fallback for workloads where prompt-evaluation dominates — its one consistent advantage in the Post 2 numbers. The 96 GB of unified VRAM still buys you the ability to load models that simply don't fit on consumer dGPUs. None of that depends on ROCm 7.x.

Most importantly: nothing I tested risked the host system or the production stack. The versionlock held across 66 ROCm packages and 6 reboots. The container path leaves no state behind. The methodology document and all 14 smoke logs are public — if you want to verify, replicate, or refute on similar hardware, the raw material is at github.com/thefrontierlab/post3-bench.

If you can reproduce this finding on a different board — particularly a non-Sixunited Strix Halo board — I'd be very interested to hear about it. If you've found a workaround I didn't try, even more interested.

Coming up: Multi-Token Prediction (MTP) merged into llama.cpp main on May 16. I deployed it to my production stack the same day, which immediately surfaced something useful: the defaults are a trap. Post 4 is the sweep that found the sweet spots, the mechanism that explains them, and the reproduction of kmarble's full-context-decode question (Vulkan barely slows, ROCm collapses 64%) — on the backend that covers every model class on this board anyway.