Look there is now build for rocm with official support for the iGPU (780M+?)

[ci] Adding gfx1103 coverage by geomin12 · Pull Request #1854 · ROCm/TheRock · GitHub

Need more time to look a it.

What does this mean? Rocm already worked fine on the FW16 since the beginning.

fine.. not really with the iGPU.

  • it have not be “supported” by AMD
  • fedora have it enable since fc42 but there is many crash/instability.

TheRock is a AMD official build (even if for now in preview)

for exemple with my FW16 (and 128Go of RAM and non dGPU):

llama.cpp build with rocm-7.10.0a20251025 can have:

  • backend CPU
  • threads 8
  • n_ubatch 4096
  • type_kv: bf16
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 13.43 ± 0.51
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 10.68 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 17.20 ± 1.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3 23.14 ± 1.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 26.34 ± 0.92
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 32.19 ± 0.11
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp12 33.76 ± 0.39
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 34.30 ± 0.13
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp24 34.60 ± 0.73
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 36.51 ± 0.43
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp48 37.35 ± 1.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 38.03 ± 2.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp96 39.72 ± 0.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 40.03 ± 0.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp192 39.52 ± 1.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 39.00 ± 0.69
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp384 38.80 ± 0.68
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 37.32 ± 0.16
  • backend ROCm
  • ngl: 999
  • n_ubatch 4096
  • fa: on
  • mmap: off
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 17.69 ± 0.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 18.69 ± 0.57
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3 24.74 ± 1.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 27.78 ± 0.31
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 37.84 ± 2.63
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp12 44.73 ± 3.98
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 52.02 ± 4.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp24 54.92 ± 2.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 60.35 ± 6.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp48 54.14 ± 0.80
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 114.08 ± 1.87
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp96 128.74 ± 1.28
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 146.75 ± 1.92
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp192 162.75 ± 2.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 184.04 ± 1.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp384 202.39 ± 0.15
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 216.78 ± 1.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp768 230.29 ± 0.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1024 242.97 ± 1.37
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1536 249.27 ± 0.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 251.07 ± 1.70
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3072 238.61 ± 0.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4096 238.47 ± 0.17
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg16 17.83 ± 0.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512+tg64 96.45 ± 0.18

but it can be faster if we take time to create optimized backend. for exemple with only ce CPU, ik_llama.cpp fork can have:

ik_llama.cpp

  • backend CPU
  • threads 8
  • n_ubatch 4096
  • type_kv: bf16
model size params test t/s
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp1 12.74 ± 0.53
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp2 20.43 ± 0.47
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp3 24.47 ± 1.33
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp4 29.05 ± 0.51
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp8 39.80 ± 1.77
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp12 43.89 ± 1.01
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp16 48.19 ± 0.14
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp24 50.36 ± 1.30
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp32 57.38 ± 0.30
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp48 69.28 ± 1.89
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp64 76.33 ± 4.12
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp96 87.84 ± 2.44
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp128 97.41 ± 2.51
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp192 107.12 ± 1.62
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp256 116.30 ± 2.83
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp384 124.47 ± 2.08
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp512 126.85 ± 1.06
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp768 136.04 ± 2.22
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp1024 138.26 ± 1.67
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp1536 138.63 ± 1.32
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp2048 136.38 ± 0.79
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp3072 131.53 ± 0.57
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp4096 123.31 ± 1.45
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B tg16 13.39 ± 1.39
gpt-oss ?B MXFP4 - 4.25 bpw 59.02 GiB 116.83 B pp512+tg64 62.91 ± 1.21

Ahh okay. I use the dGPU with rocm. Never thought to try the iGPU lol.

I did not have the dGPU… and the iGPU can use all RAM in my case I can run large model like oss-120, … etc. prety good for MOE large model.

or mistral-nemo on FP16 …

model size params test t/s
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1 2.82 ± 0.00
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1 2.81 ± 0.01
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp2 5.33 ± 0.17
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp3 7.54 ± 0.02
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp4 9.38 ± 0.00
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp8 15.66 ± 0.03
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp12 23.00 ± 0.31
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp16 30.82 ± 0.06
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp24 45.14 ± 0.08
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp32 58.77 ± 0.10
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp48 84.73 ± 0.38
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp64 106.35 ± 0.12
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp96 147.68 ± 0.22
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp128 173.62 ± 2.78
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp192 147.58 ± 0.67
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp256 171.93 ± 0.57
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp384 158.70 ± 3.12
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp512 157.58 ± 4.81
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp768 158.31 ± 3.40
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1024 175.72 ± 11.71
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp1536 178.50 ± 2.54
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp2048 171.71 ± 0.17
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp3072 173.16 ± 0.58
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp4096 158.70 ± 7.36
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B tg16 2.67 ± 0.05
Mistral-Nemo-Instruct-2407 22.81 GiB 12.25 B pp512+tg64 21.19 ± 0.22

or mistral small … but the tg is small.

Note: look somthing wrong with my fw16 it is only ~40W and low temp…

Seriously? How? The bios limits max to 8GB so how are you able to use more?

Which bios are you on? 3.06-07 did that and 3.05 would also do it in certain circumstances,

hip can (and llama.cpp with GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON env) can on linux alloc on RAM (hip_host_malloc) no limite with that.
since kernel 6.11 (or 6.12 can remamber) AMD change driver so on iGPU device alloc can use vRAM+GTT on linux. (so without need of code to use the host alloc).

For the 40W don’t know (I didn’t make bench resently , I use the framework desktop :wink: ) but after restart (did no know if needed… and on run change the USB port…) I get back with 55/65W

with that:

backend ROCm

  • ngl: 999
  • n_ubatch 4096
  • fa: on
  • mmap: off
model size params test t/s
llama 13B F16 22.81 GiB 12.25 B pp1 3.26 ± 0.03
llama 13B F16 22.81 GiB 12.25 B pp1 3.28 ± 0.01
llama 13B F16 22.81 GiB 12.25 B pp2 6.42 ± 0.07
llama 13B F16 22.81 GiB 12.25 B pp3 9.06 ± 0.07
llama 13B F16 22.81 GiB 12.25 B pp4 11.35 ± 0.11
llama 13B F16 22.81 GiB 12.25 B pp8 17.11 ± 0.07
llama 13B F16 22.81 GiB 12.25 B pp12 25.65 ± 0.17
llama 13B F16 22.81 GiB 12.25 B pp16 33.91 ± 0.24
llama 13B F16 22.81 GiB 12.25 B pp24 49.08 ± 0.56
llama 13B F16 22.81 GiB 12.25 B pp32 64.24 ± 0.33
llama 13B F16 22.81 GiB 12.25 B pp48 92.70 ± 1.08
llama 13B F16 22.81 GiB 12.25 B pp64 116.14 ± 0.88
llama 13B F16 22.81 GiB 12.25 B pp96 161.09 ± 0.54
llama 13B F16 22.81 GiB 12.25 B pp128 189.96 ± 0.85
llama 13B F16 22.81 GiB 12.25 B pp192 159.92 ± 0.53
llama 13B F16 22.81 GiB 12.25 B pp256 184.93 ± 0.48
llama 13B F16 22.81 GiB 12.25 B pp384 168.24 ± 0.70
llama 13B F16 22.81 GiB 12.25 B pp512 174.95 ± 2.95
llama 13B F16 22.81 GiB 12.25 B pp768 171.82 ± 1.88
llama 13B F16 22.81 GiB 12.25 B pp1024 186.54 ± 5.03
llama 13B F16 22.81 GiB 12.25 B pp1536 192.09 ± 2.03
llama 13B F16 22.81 GiB 12.25 B pp2048 184.93 ± 11.78
llama 13B F16 22.81 GiB 12.25 B pp3072 187.04 ± 2.65
llama 13B F16 22.81 GiB 12.25 B pp4096 177.66 ± 5.01
llama 13B F16 22.81 GiB 12.25 B tg16 3.31 ± 0.00
llama 13B F16 22.81 GiB 12.25 B pp512+tg64 25.63 ± 0.03
  • backend ROCm
  • ngl: 999
  • n_ubatch 4096
  • fa: on
  • mmap: off
  • GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
model size params test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 18.60 ± 0.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1 18.83 ± 0.03
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2 20.75 ± 0.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3 27.65 ± 2.58
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4 33.04 ± 1.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp8 50.08 ± 2.96
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp12 49.36 ± 0.97
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp16 60.69 ± 1.54
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp24 63.38 ± 3.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp32 71.41 ± 1.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp48 67.93 ± 5.06
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp64 124.51 ± 1.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp96 140.86 ± 2.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp128 161.33 ± 2.23
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp192 173.46 ± 2.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp256 199.94 ± 5.16
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp384 220.50 ± 4.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512 241.97 ± 4.74
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp768 259.88 ± 3.90
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1024 270.64 ± 1.32
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp1536 275.05 ± 2.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp2048 282.40 ± 1.73
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp3072 271.99 ± 0.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp4096 270.66 ± 0.47
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B tg16 18.98 ± 0.12
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B pp512+tg64 104.82 ± 0.97

+15% gain… yes!

(bios 3.07.)