Just starting to make my hand dirty on the beasts to build a private AI cluster (3 framework desktops) with Proxmox.
First question: I see in the BIOS that VRAM can be set to AUTO, but I cannot find anywhere whether this is a sane setting—leaving the OS to choose the amount of RAM based on requests—or if it’s better to set a size here.
According to a really decent YouTuber who does a lot of Local LLM reviewing and tests [ This guy: https://www.youtube.com/@AZisk/videos and his tests/reviews of all the different Strix Halo based systems going back a few months ] , he has empirical data / metrics that demonstrates that AUTO will run much slower than a fixed allocation.
Counterpoint: since he run one-shot performance tests the slowdown may only be during the test period. Once the system expands the memory footprint allocated to the GPU the system may become performative. However his evaluation period may never be long enough for that.
OTOH, [ aka counterpoint to my counterpoint, lol ] can LLMs dynamically expand into a growing memory pool? My experience is limited to LM Studio and I think I have to change the Context Length and GPU Offload manually, and that will always come with a reload of the LLM. So not exactly dynamic expansion to utilize the larger GPU memory allocation.
Nothing against Alex, his videos are entertaining enough and he’s been learning/improving, but his tests reflect more the state of the tools he’s using (eg whatever settings on LM Studio) than “actual” benchmarking/repeatable performance. Even when he’s running llama-bench he runs into issues because he doesn’t know how to pass the mmap flags (–help is a thing) much less flags for testing between the different Vulkan or ROCm backends, which driver versions are being used, etc.
I get this may be reflective of how an end-user might approach things, but I think especially w/ his audience, a bit of a missed opportunity. In any case though, I definitely wouldn’t treat the results as very scientific or make definitive claims off of them - they’re barely empirical since they’re not repeatable.
Easy ways to test for perf difference:
use rocm_bandwidth_test or memtest_vulkan and see if there’s a difference in MBW based on allocation (there is not)
use llama-bench and set -r as high as you want to satisfy your desired margin of error and do some comparison
The one situation you might run into perf differences would be if you have a significant amount of memory fragmentation. There’s also the problem with memory contention if you’re benchmarking with a GUI attached and not headless. While it’s valid to try to account for that, you still should figure out how to make that repeatable if you want to make any real claims…