Look there is now build for rocm with official support for the iGPU (780M+?)

[ci] Adding gfx1103 coverage by geomin12 · Pull Request #1854 · ROCm/TheRock · GitHub

Need more time to look a it.

What does this mean? Rocm already worked fine on the FW16 since the beginning.

fine.. not really with the iGPU.

  • it have not be “supported” by AMD
  • fedora have it enable since fc42 but there is many crash/instability.

TheRock is a AMD official build (even if for now in preview)

for exemple with my FW16 (and 128Go of RAM and non dGPU):

llama.cpp build with rocm-7.10.0a20251025 can have:

  • backend CPU
  • threads 8
  • n_ubatch 4096
  • type_kv: bf16
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 13.43 ± 0.51
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 10.68 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 17.20 ± 1.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3 23.14 ± 1.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 26.34 ± 0.92
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 32.19 ± 0.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp12 33.76 ± 0.39
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 34.30 ± 0.13
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp24 34.60 ± 0.73
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 36.51 ± 0.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp48 37.35 ± 1.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 38.03 ± 2.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp96 39.72 ± 0.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 40.03 ± 0.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp192 39.52 ± 1.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 39.00 ± 0.69
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp384 38.80 ± 0.68
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 37.32 ± 0.16
  • backend ROCm
  • ngl: 999
  • n_ubatch 4096
  • fa: on
  • mmap: off
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 17.69 ± 0.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 18.69 ± 0.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3 24.74 ± 1.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 27.78 ± 0.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 37.84 ± 2.63
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp12 44.73 ± 3.98
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 52.02 ± 4.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp24 54.92 ± 2.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 60.35 ± 6.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp48 54.14 ± 0.80
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 114.08 ± 1.87
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp96 128.74 ± 1.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 146.75 ± 1.92
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp192 162.75 ± 2.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 184.04 ± 1.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp384 202.39 ± 0.15
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 216.78 ± 1.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp768 230.29 ± 0.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1024 242.97 ± 1.37
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1536 249.27 ± 0.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 251.07 ± 1.70
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3072 238.61 ± 0.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4096 238.47 ± 0.17
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg16 17.83 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512+tg64 96.45 ± 0.18

but it can be faster if we take time to create optimized backend. for exemple with only ce CPU, ik_llama.cpp fork can have:

ik_llama.cpp

  • backend CPU
  • threads 8
  • n_ubatch 4096
  • type_kv: bf16
model size params test t/s
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp1 12.74 ± 0.53
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp2 20.43 ± 0.47
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp3 24.47 ± 1.33
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp4 29.05 ± 0.51
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp8 39.80 ± 1.77
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp12 43.89 ± 1.01
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp16 48.19 ± 0.14
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp24 50.36 ± 1.30
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp32 57.38 ± 0.30
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp48 69.28 ± 1.89
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp64 76.33 ± 4.12
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp96 87.84 ± 2.44
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp128 97.41 ± 2.51
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp192 107.12 ± 1.62
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp256 116.30 ± 2.83
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp384 124.47 ± 2.08
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp512 126.85 ± 1.06
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp768 136.04 ± 2.22
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp1024 138.26 ± 1.67
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp1536 138.63 ± 1.32
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp2048 136.38 ± 0.79
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp3072 131.53 ± 0.57
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp4096 123.31 ± 1.45
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B tg16 13.39 ± 1.39
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp512+tg64 62.91 ± 1.21

Ahh okay. I use the dGPU with rocm. Never thought to try the iGPU lol.

I did not have the dGPU… and the iGPU can use all RAM in my case I can run large model like oss-120, … etc. prety good for MOE large model.

or mistral-nemo on FP16 …

model size params test t/s
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1 2.82 ± 0.00
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1 2.81 ± 0.01
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp2 5.33 ± 0.17
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp3 7.54 ± 0.02
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp4 9.38 ± 0.00
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp8 15.66 ± 0.03
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp12 23.00 ± 0.31
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp16 30.82 ± 0.06
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp24 45.14 ± 0.08
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp32 58.77 ± 0.10
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp48 84.73 ± 0.38
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp64 106.35 ± 0.12
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp96 147.68 ± 0.22
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp128 173.62 ± 2.78
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp192 147.58 ± 0.67
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp256 171.93 ± 0.57
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp384 158.70 ± 3.12
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp512 157.58 ± 4.81
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp768 158.31 ± 3.40
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1024 175.72 ± 11.71
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1536 178.50 ± 2.54
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp2048 171.71 ± 0.17
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp3072 173.16 ± 0.58
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp4096 158.70 ± 7.36
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B tg16 2.67 ± 0.05
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp512+tg64 21.19 ± 0.22

or mistral small … but the tg is small.

Note: look somthing wrong with my fw16 it is only ~40W and low temp…

Seriously? How? The bios limits max to 8GB so how are you able to use more?

Which bios are you on? 3.06-07 did that and 3.05 would also do it in certain circumstances,

hip can (and llama.cpp with GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON env) can on linux alloc on RAM (hip_host_malloc) no limite with that.
since kernel 6.11 (or 6.12 can remamber) AMD change driver so on iGPU device alloc can use vRAM+GTT on linux. (so without need of code to use the host alloc).

For the 40W don’t know (I didn’t make bench resently , I use the framework desktop :wink: ) but after restart (did no know if needed… and on run change the USB port…) I get back with 55/65W

with that:

backend ROCm

  • ngl: 999
  • n_ubatch 4096
  • fa: on
  • mmap: off
model size params test t/s
llama 13B F16 22.81 GiB 12.25 B pp1 3.26 ± 0.03
llama 13B F16 22.81 GiB 12.25 B pp1 3.28 ± 0.01
llama 13B F16 22.81 GiB 12.25 B pp2 6.42 ± 0.07
llama 13B F16 22.81 GiB 12.25 B pp3 9.06 ± 0.07
llama 13B F16 22.81 GiB 12.25 B pp4 11.35 ± 0.11
llama 13B F16 22.81 GiB 12.25 B pp8 17.11 ± 0.07
llama 13B F16 22.81 GiB 12.25 B pp12 25.65 ± 0.17
llama 13B F16 22.81 GiB 12.25 B pp16 33.91 ± 0.24
llama 13B F16 22.81 GiB 12.25 B pp24 49.08 ± 0.56
llama 13B F16 22.81 GiB 12.25 B pp32 64.24 ± 0.33
llama 13B F16 22.81 GiB 12.25 B pp48 92.70 ± 1.08
llama 13B F16 22.81 GiB 12.25 B pp64 116.14 ± 0.88
llama 13B F16 22.81 GiB 12.25 B pp96 161.09 ± 0.54
llama 13B F16 22.81 GiB 12.25 B pp128 189.96 ± 0.85
llama 13B F16 22.81 GiB 12.25 B pp192 159.92 ± 0.53
llama 13B F16 22.81 GiB 12.25 B pp256 184.93 ± 0.48
llama 13B F16 22.81 GiB 12.25 B pp384 168.24 ± 0.70
llama 13B F16 22.81 GiB 12.25 B pp512 174.95 ± 2.95
llama 13B F16 22.81 GiB 12.25 B pp768 171.82 ± 1.88
llama 13B F16 22.81 GiB 12.25 B pp1024 186.54 ± 5.03
llama 13B F16 22.81 GiB 12.25 B pp1536 192.09 ± 2.03
llama 13B F16 22.81 GiB 12.25 B pp2048 184.93 ± 11.78
llama 13B F16 22.81 GiB 12.25 B pp3072 187.04 ± 2.65
llama 13B F16 22.81 GiB 12.25 B pp4096 177.66 ± 5.01
llama 13B F16 22.81 GiB 12.25 B tg16 3.31 ± 0.00
llama 13B F16 22.81 GiB 12.25 B pp512+tg64 25.63 ± 0.03
  • backend ROCm
  • ngl: 999
  • n_ubatch 4096
  • fa: on
  • mmap: off
  • GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 18.60 ± 0.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 18.83 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 20.75 ± 0.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3 27.65 ± 2.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 33.04 ± 1.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 50.08 ± 2.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp12 49.36 ± 0.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 60.69 ± 1.54
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp24 63.38 ± 3.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 71.41 ± 1.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp48 67.93 ± 5.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 124.51 ± 1.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp96 140.86 ± 2.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 161.33 ± 2.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp192 173.46 ± 2.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 199.94 ± 5.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp384 220.50 ± 4.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 241.97 ± 4.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp768 259.88 ± 3.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1024 270.64 ± 1.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1536 275.05 ± 2.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 282.40 ± 1.73
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3072 271.99 ± 0.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4096 270.66 ± 0.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg16 18.98 ± 0.12
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512+tg64 104.82 ± 0.97

+15% gain… yes!

(bios 3.07.)

1 Like

It also impacts anyone on a Ryzen Framework 13 (where all we have is the iGPU).

So you can run on the iGPU larger models and are not limited

by the dGPU’s RAM?

If so, is that at useable speed?

Just upgraded to ryzen AI 370HX with its 890M iGPU.

Is the 890M already supported too? Or where to request that?