iGPU VRAM - How much can be assigned?

I don’t believe this is properly documented anywhere, but on Linux, in the past for AMD APUs, amdgpu.gttsize has been the way to assign GTT, but that is now deprecated. The appropriate way to assign things now is via TTM.

In BIOS GART (base VRAM) can be set to 512MB or anything low, and you can use something like /etc/modprobe.d/amdgpu_llm_optimized.conf to load options. Here’s setting the max GTT pool to 120GB (w/ 60GB preallocated):

# Maximize GTT for LLM usage on 128GB UMA system
options amdgpu gttsize=120000
options ttm pages_limit=31457280
options ttm page_pool_size=15728640

Note, the ttm options are specified as 4KiB pages if you want to do your own math.

Also, while it’s not massive, I’ve found a 5-10% MBW performance boost when setting amd_iommu=off (tested with memtest_vulkan and ROCm Bandwidth Test).

Some additional real-world usage notes if you are using something llama.cpp based for inferencing:

  • by default llama.cpp sets mmap=on which is marginally bad for Vulkan model loading over 1/2 memory usage, but is catastrophic for ROCm model loading. You should use the appropriate command option (differs based on specific llama.cpp binary) to disable mmap in general for Strix Halo
  • you should also be sure to use –ngl 99 (or 999 if >99 layers) - while there is shared memory, 1) CPU memory bandwidth is basically half of the GPU due to Strix Halos memory architecture (I’ve heard due to GMI link design, but the practical part is you’re going to have much lower CPU vs GPU mbw) and 2) have the model in GPU address space is ofc more efficient for GPU inferencing
6 Likes