VRAM allocation for the 7840U frameworks

Kelvie · March 15, 2024, 8:42pm

Which processor do you have? I may do some testing this weekend if I have time.

Wrybill_Plover · March 15, 2024, 9:29pm

7840U

Nils_Ponsard · April 24, 2024, 6:00pm

What command did you use to build llama.cpp to obtain these numbers ?

with

Build command

cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

I only obtain these results :

iGPU result

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |     70.93 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.11 ± 0.19 |

build: b4e4b8a9 (2724)

That’s less than half the performance with the pp 512 test, and this was plugged to the wall in high performance profile.

Here’s my CPU results :

CPU results

| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     49.75 ± 0.69 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      7.64 ± 0.13 |

I have 2 32GB DIMMs installed, running arch linux with kernel 6.8.7

Also I noticed when running phi3 with ollama in high performance power mode that using an HIP_UMA patched version (~12 t/s) that I built is slower that the CPU version (~20 t/s). The model is also small enough that I can fit it in 4GB vram and I get around 26 t/s. This was with a simple prompt, no benchmark tho.

Djip · May 24, 2024, 3:50pm

After that PR is merge (update HIP_UMA #7399 by Djip007 · Pull Request #7414 · ggerganov/llama.cpp · GitHub)
I expect same result (or really close) with VRAM or HIP_UMA

but we may need more test.

Wrybill_Plover · May 29, 2024, 9:09am

This might be of interest: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

Looks like ROCm on 6.10-rc1 can now automatically allocate VRAM from the GTT. I only tried SD so far, not LLMs, but it worked without any need to change the standard packages or force GTT memory allocation.

Curious how llama.cpp performance would compare.

Rufo_Sanchez · June 13, 2024, 1:38pm

This is also something I’ve been interested in, especially with news of the patch @Wrybill_Plover linked; so I popped on the linux-mainline kernel on my Arch install (currently 6.10rc3-1) and compiled llama.cpp from the current HEAD as of today (172c825). Notably, since I’m on the 6.10 release candidate, I did not use the HIP_UMA flag. All these runs were made in performance mode.

llama-bench on the iGPU

# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_HIPBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 |         pp512 |    259.26 ± 0.70 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 |         tg128 |     10.69 ± 0.13 |

build: 172c8256 (3145)
HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m   73.24s user 1.50s system 100% cpu 1:14.63 total

CPU (openBLAS)

# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_OPENBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | BLAS       |       8 |         pp512 |     13.44 ± 0.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | BLAS       |       8 |         tg128 |      6.99 ± 0.20 |

build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf  3987.82s user 11.98s system 1247% cpu 5:20.56 total

CPU (no flags)

# compiled with `make -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |       8 |         pp512 |     44.67 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |       8 |         tg128 |      7.72 ± 0.13 |

build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf  1535.49s user 9.35s system 794% cpu 3:14.55 total

So I’m observing quite a good speed bump without the HIP_UMA flag enabled on 6.10.

EDIT: I noticed the other benches had a CPU backend and not BLAS, so I reran them with the default compiled backend. I may re-run them with clang using the command @Nils_Ponsard posted out of curiosity, but I have to run for the moment.

EDIT 2: Ugh, realized I had a rogue process on the CPU run Re-ran it w/gcc and it came out about 50% faster… not going to bother with the openBLAS bench again but presume that would also be about 50% faster.

EDIT 3: clang seems to be about the same speed, at least with the flags make used by default. Done with this set of experimentation for now

Wrybill_Plover · June 18, 2024, 3:06am

The only thing to keep in mind while comparing these results to mine is that I ran the tests with the Power Save PPD profile, on battery.

Djip · August 13, 2024, 2:58pm

With fedora 40, now (13/08/2024) we have rocm-6.1 for gfx1103 and kernel 6.10 that allow use of GGT . even with last UMA patch it is not needed

with llamafile and only CPU you can have:

./llamafile-0.8.12/bin/llamafile-bench -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.BF16.gguf,Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3-Q6_K.gguf,Mistral-7B-Instruct-v0.3-Q8_0.gguf"

cpu_info	model_filename	size	test	t/s
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp32	57.01
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp64	64.43
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp128	68.48
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp256	87.93
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp512	82.44
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	tg16	3.96
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp32	38.11
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp64	52.96
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp128	51.56
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp256	58.12
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp512	57.87
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	tg16	3.98
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp32	46.23
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp64	47.29
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp128	53.08
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp256	50.50
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp512	50.93
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	tg16	7.30
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp32	75.26
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp64	80.90
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp128	74.80
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp256	85.64
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp512	81.34
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	tg16	9.42

and with GPU / llama.cpp (llamafile-bench work only with CPU but we have similare speed):

make clean
# on fedora a rocm-blas is available for our GPU
module load rocm/gfx1103 
make -j16 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1103 [GGML_HIP_UMA=1]

./llama-bench -ngl 99 -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3.Q6_K.gguf,Mistral-7B-Instruct-v0.3.Q8_0.gguf"

model	size	params	backend	ngl	test	t/s
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp32	105.65 ± 0.21
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp64	192.17 ± 0.55
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp128	303.08 ± 0.70
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp256	291.73 ± 1.05
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp512	263.90 ± 0.85
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	tg16	5.32 ± 0.00
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp32	200.73 ± 0.24
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp64	106.85 ± 0.15
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp128	187.87 ± 0.14
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp256	228.23 ± 0.32
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp512	239.72 ± 0.68
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	tg16	10.22 ± 0.00
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp32	158.99 ± 0.72
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp64	113.29 ± 0.19
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp128	199.07 ± 0.26
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp256	235.31 ± 0.65
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp512	241.80 ± 0.42
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	tg16	12.68 ± 0.06

For now llamafile is the faster on CPU, for GPU both have the same speed when using hipblas, and both crache from time to time. llamafile have a tinyblas for GPU that do not crach, but is slower

Topic		Replies	Views
Reserved RAM / RAM allocation for AMD RDNA3 iGPU Framework Laptop 13	6	2909	August 2, 2023
FW16 iGPU questions Framework Laptop 16 framework-laptop-16-amd-7040	3	843	June 17, 2024
Extra Vram on framework 13 Framework Laptop 13	6	2491	December 17, 2023
Framework Laptop 13 Ryzen 300 - Configuring graphics memory Framework Laptop 13 framework-laptop-13-amd-ai-300	16	795	April 28, 2025
GPU Memory Allocation Framework Laptop 13	2	2380	August 22, 2021

VRAM allocation for the 7840U frameworks

Related topics