Which processor do you have? I may do some testing this weekend if I have time.
7840U
What command did you use to build llama.cpp to obtain these numbers ?
with
Build command
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DLLAMA_HIP_UMA=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
I only obtain these results :
iGPU result
$ HSA_OVERRIDE_GFX_VERSION=11.0.0 ./llama-bench -m ~/Downloads/llama-2-7b-chat.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | pp 512 | 70.93 ± 1.07 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | ROCm | 99 | tg 128 | 8.11 ± 0.19 |
build: b4e4b8a9 (2724)
That’s less than half the performance with the pp 512 test, and this was plugged to the wall in high performance profile.
Here’s my CPU results :
CPU results
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | pp 512 | 49.75 ± 0.69 |
| llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 8 | tg 128 | 7.64 ± 0.13 |
I have 2 32GB DIMMs installed, running arch linux with kernel 6.8.7
Also I noticed when running phi3 with ollama in high performance power mode that using an HIP_UMA patched version (~12 t/s) that I built is slower that the CPU version (~20 t/s). The model is also small enough that I can fit it in 4GB vram and I get around 26 t/s. This was with a simple prompt, no benchmark tho.