BIOS Feature Request: Add ability to specify UMA size on AMD APUs

This came up a few times in other threads - putting it here for better visibility.

Many Linux applications, and ROCm and PyTorch in particular, completely rely on the UMA buffer for their VRAM needs, instead of allocating it dynamically from the shared memory. Allowing to select a larger UMA frame buffer size in the BIOS will help immensely with using the FW13 AMD for ML/AI applications. Ideally, the options should cover allocating 16GB, if not more.

Providing such an option would make FW13 AMD a great alternative to M[1-3]-based Macbooks for ML-related projects.

9 Likes

I would add that it would be amazing if that memory parameter could be exposed through a kernel interface to the Linux kernel but I’m not sure it is possible to dynamically reallocate UMA buffers.

Generally speaking, I’m all for having more settings available to users in BIOS. Doesn’t have to be something that the “average” user would touch but for those who want to tweak their system to get more out of it, if it’s possible then let’s home framework give them the choice.

Probably a while away a best given the priority of intel 12th gen bios and security issues like logofail and bring up of FW16. Let’s hope this is an area in general where Framework can scale up their resources as they grow as a business.

2 Likes

There is at least the option to switch from “auto” (~512M) and “game optimized” (4096M) in the 3.03 BIOS, or do you mean more presets/choose freely?

Yes, I mean to be able to select the UMA size, at least up to 16GB. It doesn’t have to be a free-form entry, it can be a set of presets. On some BIOSs for AMD systems, every next preset doubles the size of the UMA frame carveout compared to the previous one. I don’t believe that’s something that is set in stone - it should be possible to have the firmware use any predefined values. If doubling the size is not a requirement, then options for 8GB, 12GB, 16GB and 24GB would be ideal (in addition to the current ones for 512MB and 4GB) .

An alternative to this would be exposing the carveout size selection as a kernel parameter to amdgpu, and have that value passed on to the firmware, effectively overriding the selection in BIOS. Such an approach would be much better overall, as it would remove the dependency on specific BIOS implementation, and benefit more AMD APU-based systems. I’m not sure if that’s a viable path, however. Maybe @Mario_Limonciello , or somebody else from AMD could comment (or, perhaps, pass this to Alex Deucher)?

2 Likes

VRAM size must be specified by the BIOS. The kernel can artificially limit it if necessary (see amdgpu.vramlimit and amdgpu.visvramlimit) but can’t make it any bigger.

3 Likes

Thanks a lot for the quick response, @Mario_Limonciello! I knew that the current driver parameters only allowed to limit the sizes, but was hoping there was a way to change this in the future.

Given that no alternatives exist, I really hope that the feature request will be heeded by the powers that be. It would be hugely appreciated.

I do not think that is realistic at all. The whole point is to do static allocation and reserve all that memory away from the CPUs. There already is a way to do that much more flexibly and at runtime with fully dynamic allocation in what the GPU considers “shared” memory. It’s just that the software you are using, is either not using a modern API that handles all of this, or is too stupid to comprehend that it is using an iGPU where the distinction between “local” memory and “shared” memory is practically irrelevant.
So the change should and will eventually come from the software that simply no longer has a need for statically allocated memory. Similar to how Intel already dictates it.

AMD may have a vested interest in keeping their stuff (ROCm) from running on iGPUs so they can sell Pro GPUs with lots of memory. And other things are probably mostly developed by and for enterprise customers that do not care about using iGPUs, so they just do not want to spend development resources on sth. that would only make sense for iGPUs. But I do not think any software actually needs to use the UMA buffer for any sensible reason.

I wonder if it is possible to have a shim program that dynamically allocates a set amount of memory and then launches another one with spoofed static allocation, that would be a quite nice workaround if it was possible.

1 Like

Sorry for going off topic. I found windows + wsl2 + directml is currently the best way to utilizing the iGPU for ml, and the shared memory works like charm. Although directml have the problem of limited frameworks and unsupported operators compared to having ROCm.
Part of the reason I brought FW13 is to try out amd/ROCm and have local llm running. I tried to get ROCm running on Ubuntu 22.04, but seems like gfx1103 isn’t supported w/ ROCm 6. Do you have any luck getting ROCm working?

Did you try setting the HSA_OVERRIDE_GFX_VERSION environment variable? I haven’t used ROCm 6 yet, but setting the override HSA_OVERRIDE_GFX_VERSION=11.0.0 works fine to get gfx1103 support on ROCm 5.6 and 5.7.

I was able to use a couple of StableDiffusion web UIs on NixOS on FW13 AMD, using a Ubuntu-based distrobox and following this guide: ROCm / HIP cookbook for any distro! Tested with Blender and Stable Diffusion on Tumbleweed with AMD Radeon 7600 - Open Chat - openSUSE Forums

The VRAM limit will likely be even more impactful for LLMs than it is for SD, however.

Can confirm, this is a major limitation for ML workloads.

In the bios, there are strings for a 3rd setting for VRAM, “UMA_SPECIFIED” and UMA size all the way from 64MB up to 16384MB. I’ve tried to modify the EFI variables associated with these parameters, however they always seem to reset to UMA_AUTO if you modify them.

For 3rd party software like pytorch, just change the source code. There already is a plugin that uses the most primitive API to just ask for any memory (seems to be like 1 line for the allocation).

For proprietary applications like games, I’d expect there is already a lot of shimming going on with the GPU driver itself or other translation layers, so putting another layer in between might be very much in the way of that and if the only solution is to lie to a game that is no longer being maintained, then the driver is simply the one that should do that.

Also the whole API model is based around the local memory having a fixed size. So you’d have to pick a fake fixed size to lie to the program about. And I do not know all the ways in which the more low level APIs allow handling data that is supposedly already in local memory, that works vastly differently in the more generic API. I could imagine that this is not always possible to translate from low level to high-level in performant ways.

Interesting. I didn’t see those in the HII database, using the UniversalAMDFormBrowser.

@Matt_Hartley , @Kieran_Levin , can a request for this please be logged with Insyde? I get that, ideally, ML libraries should be able to allocate VRAM dynamically from GTT, but, unfortunately, we’re not seeing much progress there. The BIOS fix looks like an easy stopgap, and will be very much appreciated!

Just looking at the strings in the .rsrc section, both the CbsSetupSmmPhx and CbsSetupDxePhx EFI binaries have strings for 3 settings, UMA_SPECIFIED, UMA_AUTO, and UMA_GAME_OPTIMIZED.

These correspond to valued 1, 2, and 3 I believe, with 0 disabling.

There’s also UMA Legacy vs non Legacy.

Using ifrextractor, there are tons of hidden menu options, but even in the hidden menu, UMA_SPECIFIED is not available.
The UMA FB size can be set to auto with 0xffffffff. It also supports values from 64M to 16GB just by setting those integers.

These variables are exposed via the AmdSetupPhx efi variable which is accessible from Linux with efivar.

The UMA mode is specified by offset 0x17c as a byte, uma legacy version at 0x17d at a byte (0 or 1 or ff valid), and the Uma fb size is specified at 0x17e as 4 bytes.

There’s also a UmaCarveOutDefault that sets the Uma mode. I’ve gotten the system to crash on boot by setting this to 0x1 since the 64MB frame buffer is insufficient to boot, then it automatically resets to 0x2.

If you look more into it, you’ll see that, unfortunately, the plugin is only a proof of concept. It’s not sufficient for getting a PyTorch-based application, such as one of StableDiffusion web UIs, to work with dynamic VRAM allocation. But you’re right in principle: modifying the ML frameworks and libraries to correctly work with shared memory on APUs would work.

Sorry, I’m really confused as to which API(s) you are referring to here. But, “picking a fixed size to lie about” shouldn’t be a problem, as long as it can be specified as a parameter somewhere. Even a boot time parameter would be acceptable, if not ideal.

Since the carve out is handled during the system init, before the kernel is loaded, any changes to those values should be done in BIOS/UEFI, right? How were you able to modify that setting?

I just do not like this on principle, because there is no way for it to fail gracefully when you run out of memory. So you either have to be extremely conservative, blocking the use of all of that memory and only allowing a single user, similar to how the BIOS does it right now, or when thin provisioning is in play, just like Linux handles normal memory, you run into stuff like the necessity for a OOM-Killer. And yes, that would work for everybody that is happy right now to just set UMA to a giant size and reboot. But would be plain wasteful when it could just be done dynamically. PCs are just fundamentally multi-user and multi-process. You might not care about only a single process being able to use that workaround at a time, but that means I do not think it would be an efficient use of development time to achieve.

I did not look that deeply into it, but yes, I saw that it may not work for certain use cases. As I understood it, this comes down to this plugin only providing allocate-size and de-allocate functionality (each basically a one liner) and nothing else. And certain software that expects to micromanage local memory might want to query how much memory (which basically is nonsense for this driver-managed kind of memory) is available in order to manually swap data in and out or determine what will still fit. Or try to run other operations that are simply unsupported on that type of driver-managed memory / or use it in ways that circumvent the guards the driver uses to ensure the memory is actually available when and where it is needed. Although this should not actually matter when only used with an iGPU that can just access all of system memory in a coherent way.

The efivar utility on Linux is able to read and set uefi values.

You can also directly read efi variables from the sys file system although you’ll get a few extra bytes representing the attributes of the variable that way.

I was able to iteratively modify settings in the bios and dump the EFI variables to get an idea of what settings do what, plus various reverse engineering and uefi tools.

I’m not sure it’ll be possible to modify the Uma allocation purely with variable modifications though, it may require a patched bios and it’s not clear to me what is overriding my modifications yet.

Pardon my ignorance, @the_artist , but does being able to write the variables through efivar mean that they are actually persisting beyond the current runtime state? Or, there might be a number of them that are set by the BIOS based on the values of the other ones on every boot?

In any case, very interesting results.