Ryzen ai max+ 395 - How should I configure the gpu vram from the bios settings?

Paolo_Mainardi · October 15, 2025, 1:09pm

Hi everyone,

Just starting to make my hand dirty on the beasts to build a private AI cluster (3 framework desktops) with Proxmox.

First question: I see in the BIOS that VRAM can be set to AUTO, but I cannot find anywhere whether this is a sane setting—leaving the OS to choose the amount of RAM based on requests—or if it’s better to set a size here.

Can you help me understand that?

Mario_Limonciello · October 15, 2025, 5:31pm

I suggest setting it low as possible and instead tune the TTM parameters to influence GTT memory size you want to allocate.

Mario_Limonciello · October 15, 2025, 5:32pm

Here’s a helper script for it.

Paolo_Mainardi · October 15, 2025, 6:03pm

Thanks for your reply.

So, to clear any remaining doubts, should I avoid using AUTO ?

Mario_Limonciello · October 15, 2025, 6:07pm

Auto will scale vram based upon how much memory your system came with. I don’t think it’s useful if you’re trying to go all in on GTT.

s_m8o · October 15, 2025, 8:53pm

According to a really decent YouTuber who does a lot of Local LLM reviewing and tests [ This guy: https://www.youtube.com/@AZisk/videos and his tests/reviews of all the different Strix Halo based systems going back a few months ] , he has empirical data / metrics that demonstrates that AUTO will run much slower than a fixed allocation.

Counterpoint: since he run one-shot performance tests the slowdown may only be during the test period. Once the system expands the memory footprint allocated to the GPU the system may become performative. However his evaluation period may never be long enough for that.

OTOH, [ aka counterpoint to my counterpoint, lol ] can LLMs dynamically expand into a growing memory pool? My experience is limited to LM Studio and I think I have to change the Context Length and GPU Offload manually, and that will always come with a reload of the LLM. So not exactly dynamic expansion to utilize the larger GPU memory allocation.

lhl · October 16, 2025, 8:26am

Nothing against Alex, his videos are entertaining enough and he’s been learning/improving, but his tests reflect more the state of the tools he’s using (eg whatever settings on LM Studio) than “actual” benchmarking/repeatable performance. Even when he’s running llama-bench he runs into issues because he doesn’t know how to pass the mmap flags (–help is a thing) much less flags for testing between the different Vulkan or ROCm backends, which driver versions are being used, etc.

I get this may be reflective of how an end-user might approach things, but I think especially w/ his audience, a bit of a missed opportunity. In any case though, I definitely wouldn’t treat the results as very scientific or make definitive claims off of them - they’re barely empirical since they’re not repeatable.

Easy ways to test for perf difference:

use rocm_bandwidth_test or memtest_vulkan and see if there’s a difference in MBW based on allocation (there is not)
use llama-bench and set -r as high as you want to satisfy your desired margin of error and do some comparison

The one situation you might run into perf differences would be if you have a significant amount of memory fragmentation. There’s also the problem with memory contention if you’re benchmarking with a GUI attached and not headless. While it’s valid to try to account for that, you still should figure out how to make that repeatable if you want to make any real claims…

jwp · December 31, 2025, 7:06am

I configured custom - set to 512MB and then set the GTT to 96GB (i’m running atomic desktop so needs to be done as rpm-ostree karg ; NOT a modprobe conf per the ttm helper). Note that this is using the latest 7.10 ROCm build - standard vulkan seems to generally be better unless you are working with large rag - where ROCm is miles better.

AI slop summary follows.

Running 80B Models on Framework 13 (Ryzen AI 300) - 128GB RAM Report

I’ve successfully configured the Framework 13 (HX 370) to run large-scale models (80B) entirely on the iGPU. With 128GB of RAM, I was able to allocate a 103GB GTT pool, allowing for full offloading of Q4_K_S quantized models.

System Specs

CPU/iGPU: Ryzen AI 9 HX 370 / Radeon 890M (gfx1150)
RAM: 128GB
Kernel: 6.17.7
Software: llama.cpp (Dec 2025) inside ROCm 7.1.1 Container

Benchmark Results (llama-bench)

Model: Qwen3-Next-80B (42.4 GB)
Settings: -ngl 99, --flash-attn on, --no-mmap

Backend	Prompt Processing	Token Generation
ROCm (HIP)	129.72 t/s	10.35 t/s
Vulkan (RADV)	100.33 t/s	12.89 t/s

Interactive Real-World Test (Qwen3-80B)

Vulkan Backend: 14.16 tokens/s (Total time: 8.54s for 121 tokens)
ROCm Backend: 11.50 tokens/s (Total time: 10.26s for 118 tokens)

Conclusions

Vulkan vs ROCm: For daily chat/coding, Vulkan (RADV) is ~25% faster for generation. ROCm remains superior for heavy prompt ingestion/document analysis.
Unified Memory: APUs thrive with --no-mmap. Pinning the model in the GTT pool prevents the “stutter” often seen when the OS manages the memory pages lazily.
Hardware Capability: At 14 t/s, the Framework 13 is a viable 80B-class AI workstation, outperforming many dedicated desktop setups in efficiency.

Topic		Replies	Views
GPU memory Framework Desktop	7	469	February 25, 2026
VRAM allocation for the 7840U frameworks Framework Laptop 13	27	12307	August 13, 2024
Updated commands to increase max unified memory usage on Framework Desktop under Fedora 43? Framework Desktop framework-desktop-ai-max-300 , ai	24	5047	March 14, 2026
How do I set a specific VRAM amount on 13" AMD Laptop Community Support windows , bios	17	3392	May 26, 2025
iGPU VRAM - How much can be assigned? Framework Desktop	9	8215	August 21, 2025

Ryzen ai max+ 395 - How should I configure the gpu vram from the bios settings?

System Specs

Benchmark Results (llama-bench)

Interactive Real-World Test (Qwen3-80B)

Conclusions

Related topics