VRAM allocation for the 7840U frameworks

Wrybill_Plover · March 15, 2024, 8:12pm

I just ran some tests with a 7B model. The GPU version compiled with the LLAMA_HIP_UMA=ON option outperforms the CPU by an order of magnitude, ~172 t/s vs ~15 t/s (this is on battery Power Save profile):

llama-bench using ROCm on the iGPU

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |    171.59 ± 1.90 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.58 ± 0.06 |

versus

llama-bench using on the CPU

$ llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     14.68 ± 0.53 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      5.77 ± 0.03 |

build: unknown (0)

Compiled without the UMA option, the model doesn’t fit into memory:

llama-bench using ROCm on the iGPU, no dynamic VRAM allocation

$ HSA_OVERRIDE_GFX_VERSION=11.0.0 llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6695.84 MiB on device 0: cudaMalloc failed: out of memory
main: error: failed to load model 'models/7B/llama-2-7b-chat.Q8_0.gguf'

Didn’t try 70B yet - not sure it’ll fit at all. I “only” have 64GB total.

Topic		Replies	Views
BIOS Feature Request: Add ability to specify UMA size on AMD APUs Framework Laptop 13 feature-requests , bios	69	9386	May 19, 2025
How do I set a specific VRAM amount on 13" AMD Laptop Community Support windows , bios	17	2285	May 26, 2025
Request: verify dGPU support Framework Desktop compatibility	37	1138	July 29, 2025
Help Me Make Up My Mind (FW13 Ryzen AI 9 HX 370) Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	18	1572	July 11, 2025
FW13 AI 370 performance? Framework Laptop 13 framework-laptop-13-amd-ai-300	26	535	June 4, 2025

VRAM allocation for the 7840U frameworks

Related topics