VRAM allocation for the 7840U frameworks

I just ran some tests with a 7B model. The GPU version compiled with the LLAMA_HIP_UMA=ON option outperforms the CPU by an order of magnitude, ~172 t/s vs ~15 t/s (this is on battery Power Save profile):

llama-bench using ROCm on the iGPU
$ HSA_OVERRIDE_GFX_VERSION=11.0.0 llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |    171.59 ± 1.90 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.58 ± 0.06 |

versus

llama-bench using on the CPU
$ llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     14.68 ± 0.53 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      5.77 ± 0.03 |

build: unknown (0)

Compiled without the UMA option, the model doesn’t fit into memory:

llama-bench using ROCm on the iGPU, no dynamic VRAM allocation
$ HSA_OVERRIDE_GFX_VERSION=11.0.0 llama-bench -m models/7B/llama-2-7b-chat.Q8_0.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6695.84 MiB on device 0: cudaMalloc failed: out of memory
main: error: failed to load model 'models/7B/llama-2-7b-chat.Q8_0.gguf'

Didn’t try 70B yet - not sure it’ll fit at all. I “only” have 64GB total.

1 Like