VRAM allocation for the 7840U frameworks

Which processor do you have? I may do some testing this weekend if I have time.

7840U

What command did you use to build llama.cpp to obtain these numbers ?

with

Build command
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

I only obtain these results :

iGPU result
$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |     70.93 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.11 ± 0.19 |

build: b4e4b8a9 (2724)

That’s less than half the performance with the pp 512 test, and this was plugged to the wall in high performance profile.

Here’s my CPU results :

CPU results
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     49.75 ± 0.69 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      7.64 ± 0.13 |

I have 2 32GB DIMMs installed, running arch linux with kernel 6.8.7

Also I noticed when running phi3 with ollama in high performance power mode that using an HIP_UMA patched version (~12 t/s) that I built is slower that the CPU version (~20 t/s). The model is also small enough that I can fit it in 4GB vram and I get around 26 t/s. This was with a simple prompt, no benchmark tho.

After that PR is merge (update HIP_UMA #7399 by Djip007 · Pull Request #7414 · ggerganov/llama.cpp · GitHub)
I expect same result (or really close) with VRAM or HIP_UMA

but we may need more test.

This might be of interest: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

Looks like ROCm on 6.10-rc1 can now automatically allocate VRAM from the GTT. I only tried SD so far, not LLMs, but it worked without any need to change the standard packages or force GTT memory allocation.

Curious how llama.cpp performance would compare.

1 Like

This is also something I’ve been interested in, especially with news of the patch @Wrybill_Plover linked; so I popped on the linux-mainline kernel on my Arch install (currently 6.10rc3-1) and compiled llama.cpp from the current HEAD as of today (172c825). Notably, since I’m on the 6.10 release candidate, I did not use the HIP_UMA flag. All these runs were made in performance mode.

llama-bench on the iGPU
# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_HIPBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 |         pp512 |    259.26 ± 0.70 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 |         tg128 |     10.69 ± 0.13 |

build: 172c8256 (3145)
HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m   73.24s user 1.50s system 100% cpu 1:14.63 total
CPU (openBLAS)
# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_OPENBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | BLAS       |       8 |         pp512 |     13.44 ± 0.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | BLAS       |       8 |         tg128 |      6.99 ± 0.20 |

build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf  3987.82s user 11.98s system 1247% cpu 5:20.56 total
CPU (no flags)
# compiled with `make -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |       8 |         pp512 |     44.67 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |       8 |         tg128 |      7.72 ± 0.13 |

build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf  1535.49s user 9.35s system 794% cpu 3:14.55 total

So I’m observing quite a good speed bump without the HIP_UMA flag enabled on 6.10.

EDIT: I noticed the other benches had a CPU backend and not BLAS, so I reran them with the default compiled backend. I may re-run them with clang using the command @Nils_Ponsard posted out of curiosity, but I have to run for the moment.

EDIT 2: Ugh, realized I had a rogue process on the CPU run :person_facepalming: Re-ran it w/gcc and it came out about 50% faster… not going to bother with the openBLAS bench again but presume that would also be about 50% faster.

EDIT 3: clang seems to be about the same speed, at least with the flags make used by default. Done with this set of experimentation for now :slight_smile:

1 Like

The only thing to keep in mind while comparing these results to mine is that I ran the tests with the Power Save PPD profile, on battery.

1 Like

With fedora 40, now (13/08/2024) we have rocm-6.1 for gfx1103 and kernel 6.10 that allow use of GGT . even with last UMA patch it is not needed :wink:

with llamafile and only CPU you can have:

./llamafile-0.8.12/bin/llamafile-bench -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.BF16.gguf,Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3-Q6_K.gguf,Mistral-7B-Instruct-v0.3-Q8_0.gguf"
cpu_info model_filename size test t/s
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.BF16 13.50 GiB pp32 57.01
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.BF16 13.50 GiB pp64 64.43
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.BF16 13.50 GiB pp128 68.48
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.BF16 13.50 GiB pp256 87.93
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.BF16 13.50 GiB pp512 82.44
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.BF16 13.50 GiB tg16 3.96
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.F16 13.50 GiB pp32 38.11
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.F16 13.50 GiB pp64 52.96
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.F16 13.50 GiB pp128 51.56
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.F16 13.50 GiB pp256 58.12
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.F16 13.50 GiB pp512 57.87
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3.F16 13.50 GiB tg16 3.98
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q8_0 7.17 GiB pp32 46.23
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q8_0 7.17 GiB pp64 47.29
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q8_0 7.17 GiB pp128 53.08
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q8_0 7.17 GiB pp256 50.50
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q8_0 7.17 GiB pp512 50.93
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q8_0 7.17 GiB tg16 7.30
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q6_K 5.54 GiB pp32 75.26
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q6_K 5.54 GiB pp64 80.90
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q6_K 5.54 GiB pp128 74.80
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q6_K 5.54 GiB pp256 85.64
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q6_K 5.54 GiB pp512 81.34
AMD Ryzen 9 7940HS (znver4) Mistral-7B-Instruct-v0.3-Q6_K 5.54 GiB tg16 9.42

and with GPU / llama.cpp (llamafile-bench work only with CPU but we have similare speed):

make clean
# on fedora a rocm-blas is available for our GPU
module load rocm/gfx1103 
make -j16 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1103 [GGML_HIP_UMA=1]

./llama-bench -ngl 99 -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3.Q6_K.gguf,Mistral-7B-Instruct-v0.3.Q8_0.gguf"
model size params backend ngl test t/s
llama 7B F16 13.50 GiB 7.25 B ROCm 99 pp32 105.65 ± 0.21
llama 7B F16 13.50 GiB 7.25 B ROCm 99 pp64 192.17 ± 0.55
llama 7B F16 13.50 GiB 7.25 B ROCm 99 pp128 303.08 ± 0.70
llama 7B F16 13.50 GiB 7.25 B ROCm 99 pp256 291.73 ± 1.05
llama 7B F16 13.50 GiB 7.25 B ROCm 99 pp512 263.90 ± 0.85
llama 7B F16 13.50 GiB 7.25 B ROCm 99 tg16 5.32 ± 0.00
llama 7B Q8_0 7.17 GiB 7.25 B ROCm 99 pp32 200.73 ± 0.24
llama 7B Q8_0 7.17 GiB 7.25 B ROCm 99 pp64 106.85 ± 0.15
llama 7B Q8_0 7.17 GiB 7.25 B ROCm 99 pp128 187.87 ± 0.14
llama 7B Q8_0 7.17 GiB 7.25 B ROCm 99 pp256 228.23 ± 0.32
llama 7B Q8_0 7.17 GiB 7.25 B ROCm 99 pp512 239.72 ± 0.68
llama 7B Q8_0 7.17 GiB 7.25 B ROCm 99 tg16 10.22 ± 0.00
llama 7B Q6_K 5.54 GiB 7.25 B ROCm 99 pp32 158.99 ± 0.72
llama 7B Q6_K 5.54 GiB 7.25 B ROCm 99 pp64 113.29 ± 0.19
llama 7B Q6_K 5.54 GiB 7.25 B ROCm 99 pp128 199.07 ± 0.26
llama 7B Q6_K 5.54 GiB 7.25 B ROCm 99 pp256 235.31 ± 0.65
llama 7B Q6_K 5.54 GiB 7.25 B ROCm 99 pp512 241.80 ± 0.42
llama 7B Q6_K 5.54 GiB 7.25 B ROCm 99 tg16 12.68 ± 0.06

For now llamafile is the faster on CPU, for GPU both have the same speed when using hipblas, and both crache from time to time. llamafile have a tinyblas for GPU that do not crach, but is slower :wink: