Logging · 001

Engineering's view on local frontier AI — what actually runs on hardware you control.

A weekly engineering journal benchmarking open-weights models on consumer silicon. Real harnesses, real failures, no vendor decks.

./bench/this-week.md

Qwen 3.6-35B-A3B · MoE ● PASS

tok/s · gen 47.5

vram peak 94.0G

ttft 0.18s

Bosgame M5 Strix Halo · 96GB iGPU Vulkan/RADV llama.cpp

Recent dispatches

1 A Reader Fixed Quantized KV Cache on Strix Halo. I Verified It at 262k Context, and His Theory About My Dense Cliff Survived Half Its Test A reader's fork makes quantized KV cache faster than f16 on Strix Halo. Verified on a 128GB box up to the full 262k native context. His cache-spill theory for the dense cliff survived half its test, and the failed half is the interesting part. 6 min read · Jul 30
2 Turning Off the IOMMU Made My Dense Model 37% Faster. Five More Strix Halo Tips, Measured, and One Crashed the GPU Two guides said turn the IOMMU off. I was the outlier running it on. Measured: +34 to +38% dense prompt processing. Plus five more tips, mostly busted, and one that crashes the GPU. 5 min read · Jul 23
3 I Measured the Strix Halo Tuning Tips. One Flag Matters More Than All of Them Same box, same build, one flag: +41% for the MoE model, -51% for the dense one. The Strix Halo tuning tips, measured and ranked, including the ones that failed. 8 min read · Jul 16
4 One llama.cpp Update Made My MoE Model 38% Faster and My Dense Model 45% Slower Same box, same models, same flags, three builds in three weeks. What moved, what broke, and the routine that keeps my numbers honest. 6 min read · Jul 09
5 The Strix Halo Reference Setup Pack A proven production setup you can copy: BIOS values, kernel params, working systemd units, and real benchmark numbers. 12 min read · Jul 07
6 Buy or Wait: Reading the Local LLM Hardware Question in a Memory Crunch The instinct is to wait for the next box. In a memory crunch, that's backwards. How to read the buy-or-wait question for Strix Halo, DGX Spark, and Mac Studio. 6 min read · Jul 02
7 What to Buy for Local LLM Inference: Strix Halo, Mac Studio, DGX Spark, or a GPU Rig A buyer's guide to local LLM hardware, ranked by the one spec that actually decides generation speed: memory bandwidth. Plus where ROCm really stands on Strix Halo. 13 min read · Jun 25
8 AMD Is Selling "First-Class ROCm" on Strix Halo. I've Run the Same Chip for Six Months. On June 8, AMD opened pre-orders for a $3,999 box built around the exact chip I've run in production since the start of the year, marketed on full ROCm support, with one of its own demos running on the exact model my board can't load under ROCm. 8 min read · Jun 18
9 A BIOS Update Won't Fix #6182 — I Tried the Newest One The Bosgame M5's ROCm bug is board-specific, not chip-specific — so firmware is the obvious lever. I flashed Bosgame's newest official BIOS hoping to dodge it. It didn't work, and the negative narrows where the fault actually lives. 4 min read · Jun 11
10 Full Context on a Vulkan-Only Strix Halo: The Decode-Drop Reproduces, but the Sweet Spot Moves kmarble showed ROCm decode collapses 64% at full context on Strix Halo, and ROCm+MTP cures it. My board can't run ROCm. The Vulkan half reproduces the drop — but the MTP sweet spot from last week walks left at depth: by 76k, drafting too deep is slower than no speculation at all. 12 min read · Jun 04
11 MTP Defaults Are a Trap: What 260 Runs Showed About Speculative Decoding on Qwen3.6 Until May 19, the llama.cpp speculative-decoding default was 16. On Qwen3.6's single MTP head, that default cost up to 75% of generation throughput. Here's where the real sweet spots are — and why they're architecture-specific. 11 min read · May 28
12 ROCm 7.x on the Bosgame M5: 14 Configurations, 14 Failures We promised a ROCm 7.x revisit. We got a comprehensive workaround sweep instead. Both are useful. 9 min read · May 22
13 Vulkan/RADV vs ROCm 6.4 on Strix Halo: What 128 Benchmark Runs Actually Showed The headline isn't where Vulkan wins. It's where ROCm doesn't run at all. 9 min read · May 14
14 What 96GB of VRAM on Unified-Memory Hardware Actually Gets You for Local LLM Inference An honest practitioner take from a Bosgame M5 running Strix Halo at full BIOS allocation. 8 min read · May 09