[ci] Adding gfx1103 coverage by geomin12 · Pull Request #1854 · ROCm/TheRock · GitHub
Need more time to look a it.
[ci] Adding gfx1103 coverage by geomin12 · Pull Request #1854 · ROCm/TheRock · GitHub
Need more time to look a it.
What does this mean? Rocm already worked fine on the FW16 since the beginning.
fine.. not really with the iGPU.
TheRock is a AMD official build (even if for now in preview)
for exemple with my FW16 (and 128Go of RAM and non dGPU):
llama.cpp build with rocm-7.10.0a20251025 can have:
| model | size | params | test | t/s |
|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1 | 13.43 ± 0.51 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1 | 10.68 ± 0.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2 | 17.20 ± 1.25 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp3 | 23.14 ± 1.00 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp4 | 26.34 ± 0.92 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp8 | 32.19 ± 0.11 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp12 | 33.76 ± 0.39 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp16 | 34.30 ± 0.13 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp24 | 34.60 ± 0.73 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp32 | 36.51 ± 0.43 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp48 | 37.35 ± 1.42 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp64 | 38.03 ± 2.25 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp96 | 39.72 ± 0.56 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp128 | 40.03 ± 0.31 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp192 | 39.52 ± 1.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp256 | 39.00 ± 0.69 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp384 | 38.80 ± 0.68 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp512 | 37.32 ± 0.16 |
| model | size | params | test | t/s |
|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1 | 17.69 ± 0.19 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2 | 18.69 ± 0.57 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp3 | 24.74 ± 1.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp4 | 27.78 ± 0.31 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp8 | 37.84 ± 2.63 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp12 | 44.73 ± 3.98 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp16 | 52.02 ± 4.58 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp24 | 54.92 ± 2.74 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp32 | 60.35 ± 6.97 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp48 | 54.14 ± 0.80 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp64 | 114.08 ± 1.87 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp96 | 128.74 ± 1.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp128 | 146.75 ± 1.92 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp192 | 162.75 ± 2.25 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp256 | 184.04 ± 1.42 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp384 | 202.39 ± 0.15 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp512 | 216.78 ± 1.26 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp768 | 230.29 ± 0.72 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1024 | 242.97 ± 1.37 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1536 | 249.27 ± 0.29 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 | 251.07 ± 1.70 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp3072 | 238.61 ± 0.19 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp4096 | 238.47 ± 0.17 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg16 | 17.83 ± 0.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp512+tg64 | 96.45 ± 0.18 |
but it can be faster if we take time to create optimized backend. for exemple with only ce CPU, ik_llama.cpp fork can have:
| model | size | params | test | t/s |
|---|---|---|---|---|
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp1 | 12.74 ± 0.53 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp2 | 20.43 ± 0.47 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp3 | 24.47 ± 1.33 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp4 | 29.05 ± 0.51 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp8 | 39.80 ± 1.77 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp12 | 43.89 ± 1.01 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp16 | 48.19 ± 0.14 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp24 | 50.36 ± 1.30 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp32 | 57.38 ± 0.30 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp48 | 69.28 ± 1.89 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp64 | 76.33 ± 4.12 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp96 | 87.84 ± 2.44 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp128 | 97.41 ± 2.51 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp192 | 107.12 ± 1.62 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp256 | 116.30 ± 2.83 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp384 | 124.47 ± 2.08 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp512 | 126.85 ± 1.06 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp768 | 136.04 ± 2.22 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp1024 | 138.26 ± 1.67 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp1536 | 138.63 ± 1.32 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp2048 | 136.38 ± 0.79 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp3072 | 131.53 ± 0.57 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp4096 | 123.31 ± 1.45 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | tg16 | 13.39 ± 1.39 |
| gpt-oss ?B MXFP4 - 4.25 bpw | 59.02 GiB | 116.83 B | pp512+tg64 | 62.91 ± 1.21 |
Ahh okay. I use the dGPU with rocm. Never thought to try the iGPU lol.
I did not have the dGPU… and the iGPU can use all RAM in my case I can run large model like oss-120, … etc. prety good for MOE large model.
or mistral-nemo on FP16 …
| model | size | params | test | t/s |
|---|---|---|---|---|
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp1 | 2.82 ± 0.00 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp1 | 2.81 ± 0.01 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp2 | 5.33 ± 0.17 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp3 | 7.54 ± 0.02 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp4 | 9.38 ± 0.00 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp8 | 15.66 ± 0.03 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp12 | 23.00 ± 0.31 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp16 | 30.82 ± 0.06 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp24 | 45.14 ± 0.08 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp32 | 58.77 ± 0.10 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp48 | 84.73 ± 0.38 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp64 | 106.35 ± 0.12 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp96 | 147.68 ± 0.22 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp128 | 173.62 ± 2.78 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp192 | 147.58 ± 0.67 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp256 | 171.93 ± 0.57 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp384 | 158.70 ± 3.12 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp512 | 157.58 ± 4.81 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp768 | 158.31 ± 3.40 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp1024 | 175.72 ± 11.71 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp1536 | 178.50 ± 2.54 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp2048 | 171.71 ± 0.17 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp3072 | 173.16 ± 0.58 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp4096 | 158.70 ± 7.36 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | tg16 | 2.67 ± 0.05 |
| Mistral-Nemo-Instruct-2407 | 22.81 GiB | 12.25 B | pp512+tg64 | 21.19 ± 0.22 |
or mistral small … but the tg is small.
Note: look somthing wrong with my fw16 it is only ~40W and low temp…
Seriously? How? The bios limits max to 8GB so how are you able to use more?
Which bios are you on? 3.06-07 did that and 3.05 would also do it in certain circumstances,
hip can (and llama.cpp with GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON env) can on linux alloc on RAM (hip_host_malloc) no limite with that.
since kernel 6.11 (or 6.12 can remamber) AMD change driver so on iGPU device alloc can use vRAM+GTT on linux. (so without need of code to use the host alloc).
For the 40W don’t know (I didn’t make bench resently , I use the framework desktop
) but after restart (did no know if needed… and on run change the USB port…) I get back with 55/65W
with that:
backend ROCm
| model | size | params | test | t/s |
|---|---|---|---|---|
| llama 13B F16 | 22.81 GiB | 12.25 B | pp1 | 3.26 ± 0.03 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp1 | 3.28 ± 0.01 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp2 | 6.42 ± 0.07 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp3 | 9.06 ± 0.07 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp4 | 11.35 ± 0.11 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp8 | 17.11 ± 0.07 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp12 | 25.65 ± 0.17 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp16 | 33.91 ± 0.24 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp24 | 49.08 ± 0.56 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp32 | 64.24 ± 0.33 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp48 | 92.70 ± 1.08 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp64 | 116.14 ± 0.88 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp96 | 161.09 ± 0.54 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp128 | 189.96 ± 0.85 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp192 | 159.92 ± 0.53 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp256 | 184.93 ± 0.48 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp384 | 168.24 ± 0.70 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp512 | 174.95 ± 2.95 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp768 | 171.82 ± 1.88 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp1024 | 186.54 ± 5.03 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp1536 | 192.09 ± 2.03 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp2048 | 184.93 ± 11.78 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp3072 | 187.04 ± 2.65 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp4096 | 177.66 ± 5.01 |
| llama 13B F16 | 22.81 GiB | 12.25 B | tg16 | 3.31 ± 0.00 |
| llama 13B F16 | 22.81 GiB | 12.25 B | pp512+tg64 | 25.63 ± 0.03 |
| model | size | params | test | t/s |
|---|---|---|---|---|
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1 | 18.60 ± 0.76 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1 | 18.83 ± 0.03 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2 | 20.75 ± 0.21 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp3 | 27.65 ± 2.58 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp4 | 33.04 ± 1.56 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp8 | 50.08 ± 2.96 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp12 | 49.36 ± 0.97 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp16 | 60.69 ± 1.54 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp24 | 63.38 ± 3.18 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp32 | 71.41 ± 1.29 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp48 | 67.93 ± 5.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp64 | 124.51 ± 1.76 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp96 | 140.86 ± 2.72 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp128 | 161.33 ± 2.23 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp192 | 173.46 ± 2.72 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp256 | 199.94 ± 5.16 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp384 | 220.50 ± 4.32 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp512 | 241.97 ± 4.74 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp768 | 259.88 ± 3.90 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1024 | 270.64 ± 1.32 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp1536 | 275.05 ± 2.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 | 282.40 ± 1.73 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp3072 | 271.99 ± 0.93 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp4096 | 270.66 ± 0.47 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg16 | 18.98 ± 0.12 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp512+tg64 | 104.82 ± 0.97 |
+15% gain… yes!
(bios 3.07.)
It also impacts anyone on a Ryzen Framework 13 (where all we have is the iGPU).
So you can run on the iGPU larger models and are not limited
by the dGPU’s RAM?
If so, is that at useable speed?
Just upgraded to ryzen AI 370HX with its 890M iGPU.
Is the 890M already supported too? Or where to request that?