AFAIK it’s 96GB in Windows, 110GB in Linux.
With correct api, there is no limit on Linux (well only the size of the RAM: 128Go minus RAM neaded for other active programme.. )
This is the case for other AMD APU too (when the driver did not crash )
8 or 4 Go may be good for OS it leave 120/124 Go for the LLM tensors.
For AI/ML Use case may be reporte result for https://www.localscore.ai/ with different model.
- 3 config is interesting CPU, sgemm GPU, full rebuild HIP.
- Q6_K is fast and with good quality (Q4_K_M is good for bench but have to much hallucination for me…)
- bartowski (Bartowski) have many model pres quantized, may be a selection of different model size from Lllama 3B to mistral large 123B ..
There is bf16 CPU Perf with llama.cpp too that may be interesting to have a look.
If needed I can give some more specific command to run for different cases.
I did not know if I can finish the FP8 backend of llama.cpp but if you have time I am working on a special IGPU backend for llama.cpp, for now only FP16/BF16 is supported and is optimised for the Ryzen 7940HS iGPU that have 12 CU… did not know what is the best/correct config for this 40CU … (GitHub - Djip007/llama.cpp at feature/igpu) and it may need to use the rocm-6.4 from fedora-43 (beta) …)
and for the “cluster” some bench with big MoE like the Llama 4 Maverick (if possible…) or the smaller Mistral 8x22 (in bf16 quant?)