Hey folks, I ordered the entry level Framework Desktop with Ryzen AI 385 and 32 GB RAM as a mini gaming pc in our living room. Usually I’d said that 12 to 16 GB VRAM and 16 to 20 GB System RAM would fit well for mid level gaming in Full-HD. Now I ask myself if it is actually possible to assign that much “dedicated” VRAM to the iGPU? Or are there restrictions?
The Strix Halo APU makes use of a unified RAM model much like the Mac M* series of APUs
The amount of ram that can be allocated to the iGPU is variable and can be controlled from as low as 1 GB (Citation needed, but i’m pretty sure this is the low end of the allocation) up to the maximum available to the system. in practice I understand that you can assign up to “System memory minus 8Gig” to the GPU when running windows, and rather more under Linux when not using a desktop environment. (Citation needed. this is just my understanding from the video reviews and articles of the framework desktop and other Strix Halo devices)
So a 16/16GB split as you suggest should be more than achievable.
The BIOS has reserved ( including VRAM which is 2GB ) about 4.36GB~ so that leaves me with about 123.64~ and I’ve allocated 122GB.
My guess is ( on Linux, I’m not sure about Windows as I don’t use that ) is all of it, this is assuming Strix Halo is using similar/same concept of memory architecture as Phoenix
Using Gentoo ( Linux 6.16.0 ) with KDE Plasma 6.3.6 Wayland
Thanks for pointing to the site. I haven’t been aware of it. So, up to 24 GB VRAM can be assigned to the iGPU on the basic machine with its 32 GB RAM using Windows. That sounds nice.
I don’t believe this is properly documented anywhere, but on Linux, in the past for AMD APUs, amdgpu.gttsize has been the way to assign GTT, but that is now deprecated. The appropriate way to assign things now is via TTM.
In BIOS GART (base VRAM) can be set to 512MB or anything low, and you can use something like /etc/modprobe.d/amdgpu_llm_optimized.conf to load options. Here’s setting the max GTT pool to 120GB (w/ 60GB preallocated):
# Maximize GTT for LLM usage on 128GB UMA system
options amdgpu gttsize=120000
options ttm pages_limit=31457280
options ttm page_pool_size=15728640
Note, the ttm options are specified as 4KiB pages if you want to do your own math.
Also, while it’s not massive, I’ve found a 5-10% MBW performance boost when setting amd_iommu=off (tested with memtest_vulkan and ROCm Bandwidth Test).
Some additional real-world usage notes if you are using something llama.cpp based for inferencing:
by default llama.cpp sets mmap=on which is marginally bad for Vulkan model loading over 1/2 memory usage, but is catastrophic for ROCm model loading. You should use the appropriate command option (differs based on specific llama.cpp binary) to disable mmap in general for Strix Halo
you should also be sure to use –ngl 99 (or 999 if >99 layers) - while there is shared memory, 1) CPU memory bandwidth is basically half of the GPU due to Strix Halos memory architecture (I’ve heard due to GMI link design, but the practical part is you’re going to have much lower CPU vs GPU mbw) and 2) have the model in GPU address space is ofc more efficient for GPU inferencing
VRAM is dedicated memory for the iGPU. Yes, it surprised me too—it’s faster than GTT. The reasons are simple: because it’s allocated at system startup, it is one linear chunk, it is generally mapped as a one-to-one aperture, and the APU memory access subsystem is specifically tuned for GPU usage, including bypassing the cache.
GTT, on the other hand, even when used by the iGPU, is still “mapped” memory. For all practical purposes, this means it must translate addresses and that it may disrupt the TLB when the CPU uses it concurrently. The benefit is that if you’re not using your desktop primarily as a low-power server or LLM box sitting in the corner, GTT allows the CPU to use all that memory when the GPU isn’t using it.
This may require some empirical testing to see if there’s any impact on modern AMD APUs like the Strix Halo APU. A quick search suggests (and a more thorough followup) suggests that if the mapping isn’t being flushed and there isn’t fragmentation, then there probably isn’t a perf difference.
I’m running sweeps on my box atm so will leave it for someone else to test if they care (although in practice, since you can’t expand GART to where GTT goes, even if there’s a difference, the tradeoff is you can’t access full memory for your GPU in that case).
I see no difference in pre allocated gpu memory over unified memory. I tested this on my HP ZBook Ultra G1a 14 (also a Strix Halo APU). Preallocation is useful for older apps that just look at the memory the gpu reports, some applications think there is not enough memory.