Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC (MiniMax-M2 & GLM…

In this video, I build a small Strix Halo cluster by linking the Framework Desktop and the HP Z2 Mini workstation. Both systems use AMD Ryzen AI Max “Strix Halo” with unified memory, giving a total of 256 GB available for large-scale models. Using RPC (Remote Procedure Call), the two machines work together as a single logical node for inference. I show the full setup and benchmarks for MiniMax-M2 (Q6_K_XL Unsloth dynamic quant) and GLM 4.6 (Q4_K_XL), running at 17 and 7–8 tokens per second respectively. All configurations use my latest ROCm 7 Toolbox container with RPC pre-compiled. The video also covers network configuration, performance testing with llama-bench, and scalability limits when adding more Strix Halo nodes. Timestamps: 00:00 – Intro 01:48 – Network Setup 04:04 – RPC Setup 06:14 – Running MiniMax-M2 (Q6_K_XL) 16:56 – Running GLM 4.6 (Q4_K_XL) 22:37 – Llama-Bench Results 24:28 – Cluster with 4 Strix Halos? Links & Resources: Strix Halo Toolbox (ROCm 7 + RPC): Framework Desktop (Strix Halo): Strix Halo Homelab guide and Discord (by deseven): MiniMax-M2 model: GLM 4.6 model: RPC documentation:

Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC (MiniMax-M2 & GLM 4.6)

Похожее видео