VRAM allocation for the 7840U frameworks

Which processor do you have? I may do some testing this weekend if I have time.

7840U

What command did you use to build llama.cpp to obtain these numbers ?

with

Build command
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

I only obtain these results :

iGPU result
$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | pp 512     |     70.93 ± 1.07 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | ROCm       |  99 | tg 128     |      8.11 ± 0.19 |

build: b4e4b8a9 (2724)

That’s less than half the performance with the pp 512 test, and this was plugged to the wall in high performance profile.

Here’s my CPU results :

CPU results
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | pp 512     |     49.75 ± 0.69 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |          8 | tg 128     |      7.64 ± 0.13 |

I have 2 32GB DIMMs installed, running arch linux with kernel 6.8.7

Also I noticed when running phi3 with ollama in high performance power mode that using an HIP_UMA patched version (~12 t/s) that I built is slower that the CPU version (~20 t/s). The model is also small enough that I can fit it in 4GB vram and I get around 26 t/s. This was with a simple prompt, no benchmark tho.