VRAM allocation for the 7840U frameworks

AndrewVeee · November 2, 2023, 6:25pm

I’ve been wondering the same thing. Would be nice to have that support in the bios.

It seems like it’s true that it will dynamically allocate ram for the GPU as needed (or up to half of RAM maybe?), but a lot of the tools query the availability up front and fail.

On the plus side, enough people are interested in APUs that they’re working on shared ram issues. I saw this the other day:

The code is tiny, and saw one report that it worked on a 6800hs, but it’s definitely a bit of a hack, and specifically for pytorch.

I wouldn’t want to count on framework adding these options to the bios, so probably gonna have to cross our fingers that more support for shared ram gets more code soon.

I’m still unsure if the max dynamic allocation is half of system ram, or if that also gets changed on individual laptops, and I already feel like I’m making a lot of tradeoffs to support framework haha

Kyle_Reis · November 2, 2023, 7:26pm

Max dynamic allocation is half, which IIRC is a limitation of either the drivers or the OS (can’t remember which). I’ve seen people mention there’s a workaround to allow more, although I haven’t found it.

Edit: Here’s some discussion about this (including how to override it on Linux).

James3 · March 3, 2024, 9:49pm

@Kyle_Reis
Would you have a URL link for how to make a program / app that uses Dynamically shared VRAM?

Djip · March 15, 2024, 6:21am

Can be nice if we can have more options in setting UMA in bios (UMA_AUTO / UMA_GAME_OPTIMIZED)
look like there is a hard way to do it:

Unlocking GPU Memory Allocation on AMD Ryzen™ APU? | by Winston Ma | Medium
GitHub - DavidS95/Smokeless_UMAF
with 64 ou 96 GByte of RAM can be nice if we can have 32/48/64 GByte for run some demanding IA (Mistral LLM like open-mixtral-8x7b that nead ~ 30GByte: Model Selection | Mistral AI Large Language Models . I test it with CPU (16 core Ryzen 5950X) a bit slow but good result with AMD LM-Studio (https://lmstudio.ai/) . I like to use the GPU can be noticeably faster but now can’t use gtt memory only vRam

Djip · March 15, 2024, 6:29am

A6000 and H100 is good for training and/or run large batch size of IA model (LLM in this case) But only memory is neaded for local (ie only 1 batch) inferance of LLM.

open-mixtral-8x7b work on 16 core Ryzen 5950X (slow) can’t test but I thing with good speed on Zen4 (AMD Ryzen 9 7950X3D). So prety sur it can be even faster with the RDNA3 iGPU of the 7840 (U/HS)
(and may be even better when AMD allow to use the NPU … )

Djip · March 15, 2024, 6:45am

Other sample with “old” APU (pre RNDA GPU) and stable diffusion IA:
https://www.gabriel.urdhr.fr/2022/08/28/trying-to-run-stable-diffusion-on-amd-ryzen-5-5600g/#allocating-more-vram-to-the-igpu
Look like some bios allow to reserve users define VRAM.

Wrybill_Plover · March 15, 2024, 3:24pm

Indeed. And, for what it’s worth, we have a request to add that to the FW BIOS on the forums here: BIOS Feature Request: Add ability to specify UMA size on AMD APUs

By the way, llama.cpp now supports dynamic VRAM allocation on the APUs: ROCm AMD Unified Memory Architecture (UMA) handling by ekg · Pull Request #4449 · ggerganov/llama.cpp · GitHub

Unfortunately, it doesn’t look like there are any StableDiffusion implementations that do that as well.

Adrian_Joachim · March 15, 2024, 4:06pm

And somehow the gpu still performs worse than the cpu itself XD

Wrybill_Plover · March 15, 2024, 5:14pm

Really? I didn’t try this llama.cpp version. Do you mind sharing the results you’ve seen?

The performance of StableDiffusion was much better, in my experience, when it used the iGPU than on the CPU alone. But, I could only use it within the UMA memory limits, of course.

Adrian_Joachim · March 15, 2024, 6:22pm

Was a while ago and I didn’t store results, I was playing with llama2 70b and got around 2 tokens/s on the cpu and a bit over 1 on the igpu. I did verify that it was using the gpu, amdgpu_top showed full load and apropriate vram usage. I do not really know what I am doing though.

Dynamic memory allocation looked like it worked

Kelvie · March 15, 2024, 6:41pm

I’ve also tried llama.cpp, and yeah it does seem like it’s quite a bit slower on the GPU vs just on the CPU on the 7840U.

On the GPU on really large models I had my gpu crash (forget which kernel version / driver versions I had though), I only did a short test.

In theory with 96GB of memory I could run really large models, but they take a long time right now, and haven’t really find a use-case to explore this some more.

Wrybill_Plover · March 15, 2024, 8:12pm

I just ran some tests with a 7B model. The GPU version compiled with the LLAMA_HIP_UMA=ON option outperforms the CPU by an order of magnitude, ~172 t/s vs ~15 t/s (this is on battery Power Save profile):

llama-bench using ROCm on the iGPU

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |    171.59 ± 1.90 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.58 ± 0.06 |

versus

llama-bench using on the CPU

$ llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     14.68 ± 0.53 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      5.77 ± 0.03 |

build: unknown (0)

Compiled without the UMA option, the model doesn’t fit into memory:

llama-bench using ROCm on the iGPU, no dynamic VRAM allocation

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6695.84 MiB on device 0: cudaMalloc failed: out of memory
main: error: failed to load model 'models/7B/llama-2-7b-chat.Q8_0.gguf'

Didn’t try 70B yet - not sure it’ll fit at all. I “only” have 64GB total.

Kelvie · March 15, 2024, 8:42pm

Which processor do you have? I may do some testing this weekend if I have time.

Wrybill_Plover · March 15, 2024, 9:29pm

7840U

Nils_Ponsard · April 24, 2024, 6:00pm

What command did you use to build llama.cpp to obtain these numbers ?

with

Build command

cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

I only obtain these results :

iGPU result

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |     70.93 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.11 ± 0.19 |

build: b4e4b8a9 (2724)

That’s less than half the performance with the pp 512 test, and this was plugged to the wall in high performance profile.

Here’s my CPU results :

CPU results

| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     49.75 ± 0.69 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      7.64 ± 0.13 |

I have 2 32GB DIMMs installed, running arch linux with kernel 6.8.7

Also I noticed when running phi3 with ollama in high performance power mode that using an HIP_UMA patched version (~12 t/s) that I built is slower that the CPU version (~20 t/s). The model is also small enough that I can fit it in 4GB vram and I get around 26 t/s. This was with a simple prompt, no benchmark tho.

Djip · May 24, 2024, 3:50pm

After that PR is merge (update HIP_UMA #7399 by Djip007 · Pull Request #7414 · ggerganov/llama.cpp · GitHub)
I expect same result (or really close) with VRAM or HIP_UMA

but we may need more test.

Wrybill_Plover · May 29, 2024, 9:09am

This might be of interest: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs

Looks like ROCm on 6.10-rc1 can now automatically allocate VRAM from the GTT. I only tried SD so far, not LLMs, but it worked without any need to change the standard packages or force GTT memory allocation.

Curious how llama.cpp performance would compare.

Rufo_Sanchez · June 13, 2024, 1:38pm

This is also something I’ve been interested in, especially with news of the patch @Wrybill_Plover linked; so I popped on the linux-mainline kernel on my Arch install (currently 6.10rc3-1) and compiled llama.cpp from the current HEAD as of today (172c825). Notably, since I’m on the 6.10 release candidate, I did not use the HIP_UMA flag. All these runs were made in performance mode.

llama-bench on the iGPU

# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_HIPBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 |         pp512 |    259.26 ± 0.70 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 |         tg128 |     10.69 ± 0.13 |

build: 172c8256 (3145)
HSA_OVERRIDE_GFX_VERSION="11.0.0" ./llama-bench -m   73.24s user 1.50s system 100% cpu 1:14.63 total

CPU (openBLAS)

# compiled with `HSA_OVERRIDE_GFX_VERSION="11.0.0" make LLAMA_OPENBLAS=1 -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | BLAS       |       8 |         pp512 |     13.44 ± 0.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | BLAS       |       8 |         tg128 |      6.99 ± 0.20 |

build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf  3987.82s user 11.98s system 1247% cpu 5:20.56 total

CPU (no flags)

# compiled with `make -j 8`
rufo@framework-linux (git)-[master]-% ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |       8 |         pp512 |     44.67 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |       8 |         tg128 |      7.72 ± 0.13 |

build: 172c8256 (3145)
./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf  1535.49s user 9.35s system 794% cpu 3:14.55 total

So I’m observing quite a good speed bump without the HIP_UMA flag enabled on 6.10.

EDIT: I noticed the other benches had a CPU backend and not BLAS, so I reran them with the default compiled backend. I may re-run them with clang using the command @Nils_Ponsard posted out of curiosity, but I have to run for the moment.

EDIT 2: Ugh, realized I had a rogue process on the CPU run Re-ran it w/gcc and it came out about 50% faster… not going to bother with the openBLAS bench again but presume that would also be about 50% faster.

EDIT 3: clang seems to be about the same speed, at least with the flags make used by default. Done with this set of experimentation for now

Wrybill_Plover · June 18, 2024, 3:06am

The only thing to keep in mind while comparing these results to mine is that I ran the tests with the Power Save PPD profile, on battery.

Djip · August 13, 2024, 2:58pm

With fedora 40, now (13/08/2024) we have rocm-6.1 for gfx1103 and kernel 6.10 that allow use of GGT . even with last UMA patch it is not needed

with llamafile and only CPU you can have:

./llamafile-0.8.12/bin/llamafile-bench -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.BF16.gguf,Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3-Q6_K.gguf,Mistral-7B-Instruct-v0.3-Q8_0.gguf"

cpu_info	model_filename	size	test	t/s
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp32	57.01
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp64	64.43
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp128	68.48
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp256	87.93
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	pp512	82.44
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.BF16	13.50 GiB	tg16	3.96
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp32	38.11
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp64	52.96
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp128	51.56
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp256	58.12
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	pp512	57.87
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3.F16	13.50 GiB	tg16	3.98
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp32	46.23
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp64	47.29
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp128	53.08
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp256	50.50
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	pp512	50.93
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q8_0	7.17 GiB	tg16	7.30
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp32	75.26
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp64	80.90
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp128	74.80
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp256	85.64
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	pp512	81.34
AMD Ryzen 9 7940HS (znver4)	Mistral-7B-Instruct-v0.3-Q6_K	5.54 GiB	tg16	9.42

and with GPU / llama.cpp (llamafile-bench work only with CPU but we have similare speed):

make clean
# on fedora a rocm-blas is available for our GPU
module load rocm/gfx1103 
make -j16 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1103 [GGML_HIP_UMA=1]

./llama-bench -ngl 99 -p "32,64,128,256,512" -n "16" -m "Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3.Q6_K.gguf,Mistral-7B-Instruct-v0.3.Q8_0.gguf"

model	size	params	backend	ngl	test	t/s
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp32	105.65 ± 0.21
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp64	192.17 ± 0.55
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp128	303.08 ± 0.70
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp256	291.73 ± 1.05
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	pp512	263.90 ± 0.85
llama 7B F16	13.50 GiB	7.25 B	ROCm	99	tg16	5.32 ± 0.00
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp32	200.73 ± 0.24
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp64	106.85 ± 0.15
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp128	187.87 ± 0.14
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp256	228.23 ± 0.32
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	pp512	239.72 ± 0.68
llama 7B Q8_0	7.17 GiB	7.25 B	ROCm	99	tg16	10.22 ± 0.00
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp32	158.99 ± 0.72
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp64	113.29 ± 0.19
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp128	199.07 ± 0.26
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp256	235.31 ± 0.65
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	pp512	241.80 ± 0.42
llama 7B Q6_K	5.54 GiB	7.25 B	ROCm	99	tg16	12.68 ± 0.06

For now llamafile is the faster on CPU, for GPU both have the same speed when using hipblas, and both crache from time to time. llamafile have a tinyblas for GPU that do not crach, but is slower