Which processor do you have? I may do some testing this weekend if I have time.
7840U
What command did you use to build llama.cpp to obtain these numbers ?
with
Build command
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
I only obtain these results :
iGPU result
$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | pp 512 | 70.93 ± 1.07 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | tg 128 | 8.11 ± 0.19 |
build: b4e4b8a9 (2724)
That’s less than half the performance with the pp 512 test, and this was plugged to the wall in high performance profile.
Here’s my CPU results :
CPU results
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | pp 512 | 49.75 ± 0.69 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | tg 128 | 7.64 ± 0.13 |
I have 2 32GB DIMMs installed, running arch linux with kernel 6.8.7
Also I noticed when running phi3 with ollama in high performance power mode that using an HIP_UMA patched version (~12 t/s) that I built is slower that the CPU version (~20 t/s). The model is also small enough that I can fit it in 4GB vram and I get around 26 t/s. This was with a simple prompt, no benchmark tho.
After that PR is merge (update HIP_UMA #7399 by Djip007 · Pull Request #7414 · ggerganov/llama.cpp · GitHub)
I expect same result (or really close) with VRAM or HIP_UMA
but we may need more test.
This might be of interest: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs
Looks like ROCm on 6.10-rc1 can now automatically allocate VRAM from the GTT. I only tried SD so far, not LLMs, but it worked without any need to change the standard packages or force GTT memory allocation.
Curious how llama.cpp performance would compare.
This is also something I’ve been interested in, especially with news of the patch @Wrybill_Plover linked; so I popped on the linux-mainline
kernel on my Arch install (currently 6.10rc3-1) and compiled llama.cpp from the current HEAD as of today (172c825
). Notably, since I’m on the 6.10 release candidate, I did not use the HIP_UMA
flag. All these runs were made in performance mode.
llama-bench on the iGPU
# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_HIPBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | pp512 | 259.26 ± 0.70 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | tg128 | 10.69 ± 0.13 |
build: 172c8256 (3145)
HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m 73.24s user 1.50s system 100% cpu 1:14.63 total
CPU (openBLAS)
# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_OPENBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | BLAS | 8 | pp512 | 13.44 ± 0.07 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | BLAS | 8 | tg128 | 6.99 ± 0.20 |
build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf 3987.82s user 11.98s system 1247% cpu 5:20.56 total
CPU (no flags)
# compiled with `make -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | pp512 | 44.67 ± 1.07 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | tg128 | 7.72 ± 0.13 |
build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf 1535.49s user 9.35s system 794% cpu 3:14.55 total
So I’m observing quite a good speed bump without the HIP_UMA
flag enabled on 6.10.
EDIT: I noticed the other benches had a CPU backend and not BLAS, so I reran them with the default compiled backend. I may re-run them with clang
using the command @Nils_Ponsard posted out of curiosity, but I have to run for the moment.
EDIT 2: Ugh, realized I had a rogue process on the CPU run Re-ran it w/gcc and it came out about 50% faster… not going to bother with the openBLAS bench again but presume that would also be about 50% faster.
EDIT 3: clang seems to be about the same speed, at least with the flags make used by default. Done with this set of experimentation for now
The only thing to keep in mind while comparing these results to mine is that I ran the tests with the Power Save PPD profile, on battery.
With fedora 40, now (13/08/2024) we have rocm-6.1 for gfx1103 and kernel 6.10 that allow use of GGT . even with last UMA patch it is not needed
with llamafile and only CPU you can have:
./llamafile-0.8.12/bin/llamafile-bench -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.BF16.gguf,Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3-Q6_K.gguf,Mistral-7B-Instruct-v0.3-Q8_0.gguf"
cpu_info | model_filename | size | test | t/s |
---|---|---|---|---|
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.BF16 | 13.50 GiB | pp32 | 57.01 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.BF16 | 13.50 GiB | pp64 | 64.43 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.BF16 | 13.50 GiB | pp128 | 68.48 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.BF16 | 13.50 GiB | pp256 | 87.93 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.BF16 | 13.50 GiB | pp512 | 82.44 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.BF16 | 13.50 GiB | tg16 | 3.96 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.F16 | 13.50 GiB | pp32 | 38.11 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.F16 | 13.50 GiB | pp64 | 52.96 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.F16 | 13.50 GiB | pp128 | 51.56 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.F16 | 13.50 GiB | pp256 | 58.12 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.F16 | 13.50 GiB | pp512 | 57.87 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3.F16 | 13.50 GiB | tg16 | 3.98 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q8_0 | 7.17 GiB | pp32 | 46.23 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q8_0 | 7.17 GiB | pp64 | 47.29 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q8_0 | 7.17 GiB | pp128 | 53.08 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q8_0 | 7.17 GiB | pp256 | 50.50 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q8_0 | 7.17 GiB | pp512 | 50.93 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q8_0 | 7.17 GiB | tg16 | 7.30 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q6_K | 5.54 GiB | pp32 | 75.26 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q6_K | 5.54 GiB | pp64 | 80.90 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q6_K | 5.54 GiB | pp128 | 74.80 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q6_K | 5.54 GiB | pp256 | 85.64 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q6_K | 5.54 GiB | pp512 | 81.34 |
AMD Ryzen 9 7940HS (znver4) | Mistral-7B-Instruct-v0.3-Q6_K | 5.54 GiB | tg16 | 9.42 |
and with GPU / llama.cpp (llamafile-bench work only with CPU but we have similare speed):
make clean
# on fedora a rocm-blas is available for our GPU
module load rocm/gfx1103
make -j16 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1103 [GGML_HIP_UMA=1]
./llama-bench -ngl 99 -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3.Q6_K.gguf,Mistral-7B-Instruct-v0.3.Q8_0.gguf"
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B F16 | 13.50 GiB | 7.25 B | ROCm | 99 | pp32 | 105.65 ± 0.21 |
llama 7B F16 | 13.50 GiB | 7.25 B | ROCm | 99 | pp64 | 192.17 ± 0.55 |
llama 7B F16 | 13.50 GiB | 7.25 B | ROCm | 99 | pp128 | 303.08 ± 0.70 |
llama 7B F16 | 13.50 GiB | 7.25 B | ROCm | 99 | pp256 | 291.73 ± 1.05 |
llama 7B F16 | 13.50 GiB | 7.25 B | ROCm | 99 | pp512 | 263.90 ± 0.85 |
llama 7B F16 | 13.50 GiB | 7.25 B | ROCm | 99 | tg16 | 5.32 ± 0.00 |
llama 7B Q8_0 | 7.17 GiB | 7.25 B | ROCm | 99 | pp32 | 200.73 ± 0.24 |
llama 7B Q8_0 | 7.17 GiB | 7.25 B | ROCm | 99 | pp64 | 106.85 ± 0.15 |
llama 7B Q8_0 | 7.17 GiB | 7.25 B | ROCm | 99 | pp128 | 187.87 ± 0.14 |
llama 7B Q8_0 | 7.17 GiB | 7.25 B | ROCm | 99 | pp256 | 228.23 ± 0.32 |
llama 7B Q8_0 | 7.17 GiB | 7.25 B | ROCm | 99 | pp512 | 239.72 ± 0.68 |
llama 7B Q8_0 | 7.17 GiB | 7.25 B | ROCm | 99 | tg16 | 10.22 ± 0.00 |
llama 7B Q6_K | 5.54 GiB | 7.25 B | ROCm | 99 | pp32 | 158.99 ± 0.72 |
llama 7B Q6_K | 5.54 GiB | 7.25 B | ROCm | 99 | pp64 | 113.29 ± 0.19 |
llama 7B Q6_K | 5.54 GiB | 7.25 B | ROCm | 99 | pp128 | 199.07 ± 0.26 |
llama 7B Q6_K | 5.54 GiB | 7.25 B | ROCm | 99 | pp256 | 235.31 ± 0.65 |
llama 7B Q6_K | 5.54 GiB | 7.25 B | ROCm | 99 | pp512 | 241.80 ± 0.42 |
llama 7B Q6_K | 5.54 GiB | 7.25 B | ROCm | 99 | tg16 | 12.68 ± 0.06 |
For now llamafile is the faster on CPU, for GPU both have the same speed when using hipblas, and both crache from time to time. llamafile have a tinyblas for GPU that do not crach, but is slower