BIOS Feature Request: Add ability to specify UMA size on AMD APUs

Xal · May 17, 2024, 3:07pm

stable diffusion works miles better with the extra ram. 512x512 takes seconds instead of minutes. i can do 1080x1080 in about a minute. granted the test i have done were not bench-marked and simple prompts. but it uses about 8GB works no issues. I stop using the --novram. similar situation with LM studio or just llama.cpp (although llama added support for igpus via a flag)

Wrybill_Plover · May 18, 2024, 2:54am

I was able to use Smokeless to successfully set 16GB UMA Frame Buffer as well! The main difference I noticed so far was being able to do some outpainting in InvokeAI, which previously always ran out of memory. I haven’t got a chance to try larger models or images yet.

However, what is strange is that I’m not seeing the boot entries you’re seeing, @Xal, either on the BIOS Boot Manager screen, or in the efibootmgr. Like you, I was using the Beta version, booting it from a FAT32-formatted USB. Not sure what we’re doing differently.

Wrybill_Plover · May 18, 2024, 9:36am

Another great news: I was able to follow the advice from here, and use force-host-alloction-APU to get StableDiffusion to dynamically allocate VRAM from the GTT, without the need to do anything with the UMA.

SDXL models worked, generating 1024x1024 images didn’t run out of memory or crash, and the speed was similar to what I saw after successfully using Smokeless, maybe just a tad slower: about 1.9it/s for SD-1 512x512, and about 3.5s/it for SDXL 1024x1024 and ~1.3it/s for 512x512.

Xal · May 18, 2024, 8:29pm

thanks, this is very useful. Covers most of my needs. I guess the one things pending would be legacy apps that cant be updated to accomodate for igpus

Xal · May 18, 2024, 8:32pm

try adding them and see if something happens:

Boot0003* UEFI Misc Device      VenHw(77e79a1e-e1fb-491f-a7c1-fa1b5412532a){auto_created_boot_option}
      dp: 01 04 14 00 1e 9a e7 77 fb e1 1f 49 a7 c1 fa 1b 54 12 53 2a / 7f ff 04 00
    data: 4e ac 08 81 11 9f 59 4d 85 0e e2 1a 52 2c 59 b2
Boot0004* UEFI Misc Device 2    VenHw(1c54c333-24ff-4506-a9d6-0a624e09ae7e){auto_created_boot_option}
      dp: 01 04 14 00 33 c3 54 1c ff 24 06 45 a9 d6 0a 62 4e 09 ae 7e / 7f ff 04 00
    data: 4e ac 08 81 11 9f 59 4d 85 0e e2 1a 52 2c 59 b2
Boot0005* UEFI Misc Device 3    VenHw(8f1c1ac6-fbc0-4dab-a8be-b412a13c8b45){auto_created_boot_option}
      dp: 01 04 14 00 c6 1a 1c 8f c0 fb ab 4d a8 be b4 12 a1 3c 8b 45 / 7f ff 04 00
    data: 4e ac 08 81 11 9f 59 4d 85 0e e2 1a 52 2c 59 b2
Boot0006* UEFI Misc Device 4    VenHw(7517821f-d9e1-44c1-a75a-d054cef3f8f8){auto_created_boot_option}
      dp: 01 04 14 00 1f 82 17 75 e1 d9 c1 44 a7 5a d0 54 ce f3 f8 f8 / 7f ff 04 00
    data: 4e ac 08 81 11 9f 59 4d 85 0e e2 1a 52 2c 59 b2
Boot0007* UEFI Misc Device 5    VenHw(e0ba9b98-dd2d-4434-bb94-599cc9e4305d){auto_created_boot_option}
      dp: 01 04 14 00 98 9b ba e0 2d dd 34 44 bb 94 59 9c c9 e4 30 5d / 7f ff 04 00
    data: 4e ac 08 81 11 9f 59 4d 85 0e e2 1a 52 2c 59 b2

I booted a few time in row from smokeless i also spend a few minutes going through the menus. maybe enough time for it to add the entries? will check my history to link the post where some ppl saw the same behavior i did.

Adding the entries should work if my assumptions of of no binaries and “data” being the the combo for hidden menu (like in other motherboards).

Xal · May 19, 2024, 6:53pm

Did you get a libstdc++ error at some point when compiling? also did you even get a malloc thingy to be ignored by python?

Wrybill_Plover · May 19, 2024, 9:39pm

No. The only thing I needed to compile was forcegttalloc.c, I compiled it with hipcc following the instructions on the GitHub page, and it compiled without any errors or warnings.

However, I did get an libstdc++ loading error at first, when I tried running InvokeAI through the provided invoke.sh script with the LD_PRELOAD variable set:

📦[user@rocm invokeai]$ LD_PRELOAD=../force-host-alloction-APU/libforcegttalloc.so HSA_OVERRIDE_GFX_VERSION=11.0.0 ./invoke.sh 
dirname: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory

Instead of troubleshooting the path references in the script, I just activated the venv manually, and executed invokeai-web directly, like this:

(.venv) 📦[user@rocm invokeai]$ LD_PRELOAD=../force-host-alloction-APU/libforcegttalloc.so HSA_OVERRIDE_GFX_VERSION=11.0.0 invokeai-web

It managed to find the libraries just fine then.

Not sure what you mean. I only tried this with InvokeAI so far, not directly from Python, and didn’t get any memory-related errors (that I noticed).

Xal · May 20, 2024, 2:01am

I can’t get it to compile within the virtual environment. With sudo it can’t find libstdc, even when i -L the directory. Without sudo it gives me a permission error that wont go away even when using a temp folder or changing the permissions to allow everyone.

I compile out of the environment, but of course that is no good. When putting that in the environment it gets ignored as it cant find the libraries in the right place.

Im sure im drowning in a puddle. I guess im too tired will retry everything in the next few days.

Are you using a container? (Anything else aside from the pything venv).

Wrybill_Plover · May 20, 2024, 3:15am

Yes. I’m using the distrobox, which I set up mostly following the guide I mentioned here. I didn’t need to add anything to the system to compile the module, just got the source from the force-host-alloction-APU repo, and used the already installed in the distrobox hipcc.

I don’t think you need sudo for anything other than the initial installation of ROCm or other system-wide packages. I didn’t use sudo for anything at all this time around, since I already had the distrobox set up a few months ago, when I first experimented with InvokeAI on the Framework.

Xal · May 20, 2024, 5:23am

Finally managed. for whatever reason the linker was not finding the library, ended up just doing a symlink to trick it. This solution seems to work better than the UMA smokeless solution. Even in --highvram (UMA) it kept crashing as it moves models in and out of memory it was causing the igpu to reset when vram got low (depending on what I was doing) which made the whole thing get stuck.

Wrybill_Plover · May 20, 2024, 10:21am

Yes, I’m pretty happy with it too, so far . Really thankful to Carlos Segura for putting the memory allocation module together!

Speaking of crashing, I did have the entire Hyprland session crash on me a few times, as I was running generations on different models and resolutions. But, it well could be just the general instability of the other components involved. In my experience, StableDiffusion implementations tend to be not particularly stable, as a rule…

Wrybill_Plover · May 29, 2024, 8:54am

Even more good news! Automatic native VRAM allocation from GTT in the AMDKFD driver (which is used by ROCm, etc.) on APUs was pushed to the upcoming Linux 6.10, and is already present in 6.10-rc1!

Phoronix reported on it here: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs.

I just tried 6.10-rc1, and was able to run SDXL generations out of the box, with no changes to the tool set, or the dynamic library loading tricks mentioned above!

Xal · June 2, 2024, 3:04am

Im glad to read this. Thanks for pointing it out.

On SD crashing yes, some of the components cause that. I have also observed that it will also crash with if Python thinks there is not enough vram (even with the workarounds). While the ram used is the system’s with the module the whole thing will just crash and reset the graphics and session if not “enough” vram. I most stable I have been able to run is with the UMA set to 16GB, and the allocation module. I once this a batch of 100+ images 1024*1024.

Batch might not be the right word as I queued a bunch of images. Been trying to get some art concepts done as I am getting into game dev again.

Djip · August 13, 2024, 3:13pm

I can confim it work! With fedora 40 that now have kernel 6.10.3 it is now working out of the box.

if needed you can change default GGT size with kernel param (default is 1/2 of RAM):

# for 16Go:
amdgpu.gttsize=16384

Justin_Weiss · August 31, 2024, 9:42pm

How have you been running this? I’ve tried ComfyUI and stable-diffusion-webui off and on over the last few months, and have never got it to generate more than one image before crashing my entire session, even using kernel 6.10.7 and ROCm 6.1.

Without specifying HSA_OVERRIDE_GFX_VERSION, I get “HIP error: invalid device function.” If I specify HSA_OVERRIDE_GFX_VERSION=11.0.0, it will rarely generate an image but will usually flash the screen black and crash the whole desktop session, needing a restart. Looking through github issues, I thought it was just that there was no combination of kernel / pytorch / ROCm that would work on AMD APUs, but it sounds like it’s working for some people?

Wrybill_Plover · September 1, 2024, 4:06am

Yes, it worked for me on AMD APUs, under more than one configuration (different laptops, different amount of system RAM and RAM utilization settings, different distros, kernel versions, etc.). A few different Web UIs as well, although I still haven’t got to trying out Comfy.

I mentioned how I got a working setup on a FW13 AMD here. That was a while ago, but the same thing works with a newer version of ROCm as well. And, with a newer kernel, there’s no longer the limitation of Pytorch only seeing the RAM reserved by the UMA buffer window as VRAM.

Could you tell us more about your system and settings? Maybe we’ll be able to spot what is causing the crashes. Or, try to follow the instructions linked in that post, and see where that gets you.

Justin_Weiss · September 1, 2024, 6:21pm

Sure thing. I followed that post to see if it would help, or at least make things more consistent.

I’m running on an Arch host (6.10.7-arch1-1) running AUTOMATIC1111 in a Ubuntu 24.04 distrobox.

I just added sg_display=0 to the kernel parameters, which seems to have made things more stable. After a GPU reset the system is much more likely to recover when I have that parameter. I have UMA set to game mode, but it didn’t seem to make a difference either way.

512x512 generates pretty consistently without problems. 768x768 and higher will usually cause a GPU reset sometime during generation – the screen goes blank, KWin tells me it had to restart, and SD stops generating until I kill and restart it (before I added the sg_display parameter, this would usually hang the whole system and show video corruption)

The distrobox is running ROCm 6.2, but I’ve seen the same symptoms since 5.7. I’m running with HSA_OVERRIDE_GFX_VERSION=11.0.0 ./webui.sh

I also tried the iommu=soft kernel parameter, it didn’t seem to make a difference. I also tried setting /sys/class/drm/card0/device/power_dpm_force_performance_level to high, but that would cause the entire machine to shut off when I started a generation. I also tried a memtest86+ run, which it passed.

dmesg shows this during a reset:

[ 1214.308511] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1214.308516] amdgpu 0000:c1:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ 1214.308518] amdgpu 0000:c1:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 1214.308520] amdgpu 0000:c1:00.0: amdgpu: Failed to evict queue 1
[ 1214.308526] amdgpu: Failed to evict process queues
[ 1214.308527] amdgpu: Failed to quiesce KFD
[ 1214.308550] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 1214.398324] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:231
[ 1214.465471] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:223
[ 1216.474195] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1216.474203] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1218.478556] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1218.478567] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1220.483074] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1220.483084] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1220.755651] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 1220.757538] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 1220.792892] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1220.793481] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[ 1220.793599] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 1220.796664] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 1220.798728] [drm] DMUB hardware initialized: version=0x08004000
[ 1220.810819] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:223
[ 1220.813168] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:231
[ 1220.815548] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:239
[ 1220.817927] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:247
[ 1221.308284] [drm] kiq ring mec 3 pipe 1 q 0
[ 1221.310941] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 1221.311527] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1221.311529] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1221.311530] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1221.311531] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 1221.311533] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 1221.311533] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 1221.311534] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 1221.311536] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 1221.311537] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 1221.311538] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1221.311539] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 1221.311541] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 1221.311542] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 1221.313645] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow start
[ 1221.313646] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow done
[ 1221.313662] amdgpu 0000:c1:00.0: amdgpu: GPU reset(7) succeeded!

It’s a similar error to this one, which is why I thought it wasn’t possible to get to work on these APUs.

Wrybill_Plover · September 15, 2024, 1:46am

I am not using Arch these days, but to investigate this, I did some testing on Arch directly, as well as in Ubuntu 22.04 and 24.04 distrobox environments on NixOS, and also directly on NixOS. Admittedly, it’s a bit like apples to oranges to peaches (?) comparison, on top of not being particularly rigorous. But here are my observations, for what it’s worth:

The Ubuntu 22.04 distrobox with Python 3.10 and ROCm 5.6, under NixOS, seemed to be the most stable. InvokeAI (the version I installed a while ago - not sure about the most up-to-date one) was fine generating images up to 1024x1024 and using models up to 7.2GB in size (the largest ones I had that my InvokeAI supported). I was able to crash it, eventually, but I would say it was almost dependable, for text2image inference.

ComfyUI was easier to crash, but still handled those same cases Ok. With larger models, such as Flux and Flux derived ones, it was mixed success: some crashed on loading, some in VAE, and some successfully produced 1024x1024 images.

Forge was more successful with Flux than Comfy, but still crashed on loading the larger sizes. The error messages on crashes were similar to the ones listed by @Justin_Weiss, and in the AMD bug report he linked.

Ubuntu 24.04 distrobox with ROCm 6.1 and bare NixOS with 6.0 were less stable still. 6.1 was, probably, the worst. 6.0 felt more usable, and I was able to generate 1024x1024 images with (moderately) large SDXL models in ComfyUI, although not reliably. 6.1 would crash almost immediately.

On Arch, the same pattern remained: 6.0 was more stable than 6.1, ComfyUI was able to generate larger images with the SDXL and other ~6-7GB models. But would also crash every now and then.

All in all, the state of memory handling throughout the amdgpu/ROCm/SD stack, and the reliability and fit of the different component versions still leave much room for improvement. Hopefully, the queue-related bug report referenced above will be addressed - although it’s not clear if the problem is, indeed, in amdgpu, or elsewhere in the stack. Newer versions of ROCm seem to handle issues worse than the older ones on this platform, but, perhaps it’s also a matter of configuration - I used everything with default settings - or maybe of a version match with other components…

I did not have the amdgpu.sg_scatter=0 on either NixOS or Arch: I removed that a couple of kernel versions ago, and adding it back, at least on NixOS, didn’t improve the stability. My system is FW13 with AMD 7840u, and 64GB of RAM. The kernel versions were 6.10.7 on NixOS and 6.10.8 on Arch

Also, a useful tip: if your inference is constantly crashing, or is suddenly running very slow, regardless of what you do, clean the ~/.config/miopen folder. I spent several hours trying to understand why my previously working under ROCm 6.0 ComfyUI setup started to run excruciatingly slow after a crash, before finally tracing it to the memory cache in that folder. Ironically, the presence of the cache files does make loading larger models more stable - so, if you have to delete them, running an inference with a smaller model first might help to load a larger model later…

Nick_Heyart · September 25, 2024, 2:36am

I can also vouch for this method, was able to allocate 8GB of VRAM and play games with higher texture quality on Linux.

For anyone else doing this, go to GFX Config, set IGPU config to UMA specified, then set the framebuffer size to the desired amount of VRAM. Do not just save settings and boot back into the OS after this, since that will trigger the BIOS to re-apply the 4GB setting. You need to back out into the main menu, then use the boot manager to select your OS and boot it that way.

Yan-Fa_Li · October 16, 2024, 7:21am

I would love this feature. Then folks could run llms or stable diffusion on a framework; what a great selling point. Especially now 48GiB modules are officially supported. Let’s use that RAM!