This is also something I’ve been interested in, especially with news of the patch @Wrybill_Plover linked; so I popped on the linux-mainline
kernel on my Arch install (currently 6.10rc3-1) and compiled llama.cpp from the current HEAD as of today (172c825
). Notably, since I’m on the 6.10 release candidate, I did not use the HIP_UMA
flag. All these runs were made in performance mode.
llama-bench on the iGPU
# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_HIPBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | pp512 | 259.26 ± 0.70 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | tg128 | 10.69 ± 0.13 |
build: 172c8256 (3145)
HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m 73.24s user 1.50s system 100% cpu 1:14.63 total
CPU (openBLAS)
# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_OPENBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | BLAS | 8 | pp512 | 13.44 ± 0.07 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | BLAS | 8 | tg128 | 6.99 ± 0.20 |
build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf 3987.82s user 11.98s system 1247% cpu 5:20.56 total
CPU (no flags)
# compiled with `make -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | pp512 | 44.67 ± 1.07 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | tg128 | 7.72 ± 0.13 |
build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf 1535.49s user 9.35s system 794% cpu 3:14.55 total
So I’m observing quite a good speed bump without the HIP_UMA
flag enabled on 6.10.
EDIT: I noticed the other benches had a CPU backend and not BLAS, so I reran them with the default compiled backend. I may re-run them with clang
using the command @Nils_Ponsard posted out of curiosity, but I have to run for the moment.
EDIT 2: Ugh, realized I had a rogue process on the CPU run Re-ran it w/gcc and it came out about 50% faster… not going to bother with the openBLAS bench again but presume that would also be about 50% faster.
EDIT 3: clang seems to be about the same speed, at least with the flags make used by default. Done with this set of experimentation for now