Getting 20 t/s on dual Sparks using VLLM in tensor parallel mode over Infiniband with RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4.
Same workflow running over Ethernet was giving me 16 t/s.
Same physical port and cable.
Getting 20 t/s on dual Sparks using VLLM in tensor parallel mode over Infiniband with RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4.
Same workflow running over Ethernet was giving me 16 t/s.
Same physical port and cable.
Turned out GB10 is not yet optimized for FP4 quants, so AWQ gave me 25 t/s on the same model.
Also, 40 t/s on Minimax M2 in AWQ 4-bit is very usable for coding.
Wow, I was able to run GLM-4.6 in 4-bit AWQ on my dual Sparks and the performance was acceptable. 16 t/s is not fast by any measure, but usable. Prompt processing speeds were pretty decent too.
Could only fit 50K context. I guess if I optimized my memory footprint, I could ramp it up to 64K.