VRAM size must be specified by the BIOS. The kernel can artificially limit it if necessary (see amdgpu.vramlimit
and amdgpu.visvramlimit
) but can’t make it any bigger.
Thanks a lot for the quick response, @Mario_Limonciello! I knew that the current driver parameters only allowed to limit the sizes, but was hoping there was a way to change this in the future.
Given that no alternatives exist, I really hope that the feature request will be heeded by the powers that be. It would be hugely appreciated.
I do not think that is realistic at all. The whole point is to do static allocation and reserve all that memory away from the CPUs. There already is a way to do that much more flexibly and at runtime with fully dynamic allocation in what the GPU considers “shared” memory. It’s just that the software you are using, is either not using a modern API that handles all of this, or is too stupid to comprehend that it is using an iGPU where the distinction between “local” memory and “shared” memory is practically irrelevant.
So the change should and will eventually come from the software that simply no longer has a need for statically allocated memory. Similar to how Intel already dictates it.
AMD may have a vested interest in keeping their stuff (ROCm) from running on iGPUs so they can sell Pro GPUs with lots of memory. And other things are probably mostly developed by and for enterprise customers that do not care about using iGPUs, so they just do not want to spend development resources on sth. that would only make sense for iGPUs. But I do not think any software actually needs to use the UMA buffer for any sensible reason.
I wonder if it is possible to have a shim program that dynamically allocates a set amount of memory and then launches another one with spoofed static allocation, that would be a quite nice workaround if it was possible.
Sorry for going off topic. I found windows + wsl2 + directml is currently the best way to utilizing the iGPU for ml, and the shared memory works like charm. Although directml have the problem of limited frameworks and unsupported operators compared to having ROCm.
Part of the reason I brought FW13 is to try out amd/ROCm and have local llm running. I tried to get ROCm running on Ubuntu 22.04, but seems like gfx1103 isn’t supported w/ ROCm 6. Do you have any luck getting ROCm working?
Did you try setting the HSA_OVERRIDE_GFX_VERSION
environment variable? I haven’t used ROCm 6 yet, but setting the override HSA_OVERRIDE_GFX_VERSION=11.0.0
works fine to get gfx1103 support on ROCm 5.6 and 5.7.
I was able to use a couple of StableDiffusion web UIs on NixOS on FW13 AMD, using a Ubuntu-based distrobox and following this guide: ROCm / HIP cookbook for any distro! Tested with Blender and Stable Diffusion on Tumbleweed with AMD Radeon 7600 - Open Chat - openSUSE Forums
The VRAM limit will likely be even more impactful for LLMs than it is for SD, however.
Can confirm, this is a major limitation for ML workloads.
In the bios, there are strings for a 3rd setting for VRAM, “UMA_SPECIFIED” and UMA size all the way from 64MB up to 16384MB. I’ve tried to modify the EFI variables associated with these parameters, however they always seem to reset to UMA_AUTO if you modify them.
For 3rd party software like pytorch, just change the source code. There already is a plugin that uses the most primitive API to just ask for any memory (seems to be like 1 line for the allocation).
For proprietary applications like games, I’d expect there is already a lot of shimming going on with the GPU driver itself or other translation layers, so putting another layer in between might be very much in the way of that and if the only solution is to lie to a game that is no longer being maintained, then the driver is simply the one that should do that.
Also the whole API model is based around the local memory having a fixed size. So you’d have to pick a fake fixed size to lie to the program about. And I do not know all the ways in which the more low level APIs allow handling data that is supposedly already in local memory, that works vastly differently in the more generic API. I could imagine that this is not always possible to translate from low level to high-level in performant ways.
Interesting. I didn’t see those in the HII database, using the UniversalAMDFormBrowser.
@Matt_Hartley , @Kieran_Levin , can a request for this please be logged with Insyde? I get that, ideally, ML libraries should be able to allocate VRAM dynamically from GTT, but, unfortunately, we’re not seeing much progress there. The BIOS fix looks like an easy stopgap, and will be very much appreciated!
Just looking at the strings in the .rsrc section, both the CbsSetupSmmPhx and CbsSetupDxePhx EFI binaries have strings for 3 settings, UMA_SPECIFIED, UMA_AUTO, and UMA_GAME_OPTIMIZED.
These correspond to valued 1, 2, and 3 I believe, with 0 disabling.
There’s also UMA Legacy vs non Legacy.
Using ifrextractor, there are tons of hidden menu options, but even in the hidden menu, UMA_SPECIFIED is not available.
The UMA FB size can be set to auto with 0xffffffff. It also supports values from 64M to 16GB just by setting those integers.
These variables are exposed via the AmdSetupPhx efi variable which is accessible from Linux with efivar.
The UMA mode is specified by offset 0x17c as a byte, uma legacy version at 0x17d at a byte (0 or 1 or ff valid), and the Uma fb size is specified at 0x17e as 4 bytes.
There’s also a UmaCarveOutDefault that sets the Uma mode. I’ve gotten the system to crash on boot by setting this to 0x1 since the 64MB frame buffer is insufficient to boot, then it automatically resets to 0x2.
If you look more into it, you’ll see that, unfortunately, the plugin is only a proof of concept. It’s not sufficient for getting a PyTorch-based application, such as one of StableDiffusion web UIs, to work with dynamic VRAM allocation. But you’re right in principle: modifying the ML frameworks and libraries to correctly work with shared memory on APUs would work.
Sorry, I’m really confused as to which API(s) you are referring to here. But, “picking a fixed size to lie about” shouldn’t be a problem, as long as it can be specified as a parameter somewhere. Even a boot time parameter would be acceptable, if not ideal.
Since the carve out is handled during the system init, before the kernel is loaded, any changes to those values should be done in BIOS/UEFI, right? How were you able to modify that setting?
I just do not like this on principle, because there is no way for it to fail gracefully when you run out of memory. So you either have to be extremely conservative, blocking the use of all of that memory and only allowing a single user, similar to how the BIOS does it right now, or when thin provisioning is in play, just like Linux handles normal memory, you run into stuff like the necessity for a OOM-Killer. And yes, that would work for everybody that is happy right now to just set UMA to a giant size and reboot. But would be plain wasteful when it could just be done dynamically. PCs are just fundamentally multi-user and multi-process. You might not care about only a single process being able to use that workaround at a time, but that means I do not think it would be an efficient use of development time to achieve.
I did not look that deeply into it, but yes, I saw that it may not work for certain use cases. As I understood it, this comes down to this plugin only providing allocate-size and de-allocate functionality (each basically a one liner) and nothing else. And certain software that expects to micromanage local memory might want to query how much memory (which basically is nonsense for this driver-managed kind of memory) is available in order to manually swap data in and out or determine what will still fit. Or try to run other operations that are simply unsupported on that type of driver-managed memory / or use it in ways that circumvent the guards the driver uses to ensure the memory is actually available when and where it is needed. Although this should not actually matter when only used with an iGPU that can just access all of system memory in a coherent way.
The efivar utility on Linux is able to read and set uefi values.
You can also directly read efi variables from the sys file system although you’ll get a few extra bytes representing the attributes of the variable that way.
I was able to iteratively modify settings in the bios and dump the EFI variables to get an idea of what settings do what, plus various reverse engineering and uefi tools.
I’m not sure it’ll be possible to modify the Uma allocation purely with variable modifications though, it may require a patched bios and it’s not clear to me what is overriding my modifications yet.
Pardon my ignorance, @the_artist , but does being able to write the variables through efivar mean that they are actually persisting beyond the current runtime state? Or, there might be a number of them that are set by the BIOS based on the values of the other ones on every boot?
In any case, very interesting results.
Some values persist, some values reset on boot, some reset on failure. Changing some values also causes the bios to change other values.
There’s an initialization vector that reset much of the CBS values to defaults too, although it must either be loading the bios stored values from elsewhere or it knows to skip it.
I’ve tried altering just about everything that seems like it would adjust vram size with no luck. It’s likely the bios code itself needs to be patched since it keeps resetting UMA_SPECIFIED to UMA_AUTO on boot. I don’t know where auto and game optimized pull their ram settings from either, although I believe I’ve had a few boots where auto booted with 4GB vram instead of 512MB. I’m not exploring this too methodically yet though, I was hoping for an easy win.
I did notice some variables in the ConIn section as well as a vga section are changing as I’m messing with settings, so I may possibly have some other settings to explore. I’m not too worried since the bios seems to reset to factory defaults if it’s the unable to boot.
This tool might be helpful: [TOOL] SREP (SmokelessRuntimeEFIPatcher) - BIOS Modding Guides and Problems - Win-Raid Forum
Having got comfyui working today, I’d like to get it out of lowvram mode. I have 96gb installed and can only use 4gb of it in stable diffusion.
I’d really like to see us be able to carve out at least 16gb until PyTorch gets sorted to use gtt.
This would be great. I can volunteer for some testing. I have been looking for away to do this myself with no luck.
@Kieran_Levin , is there a chance this will be raised with Insyde and/or AMD? Even keeping it a hidden option accessible through Smokeless would work…