BIOS Feature Request: Add ability to specify UMA size on AMD APUs

Wrybill_Plover · September 1, 2024, 4:06am

Yes, it worked for me on AMD APUs, under more than one configuration (different laptops, different amount of system RAM and RAM utilization settings, different distros, kernel versions, etc.). A few different Web UIs as well, although I still haven’t got to trying out Comfy.

I mentioned how I got a working setup on a FW13 AMD here. That was a while ago, but the same thing works with a newer version of ROCm as well. And, with a newer kernel, there’s no longer the limitation of Pytorch only seeing the RAM reserved by the UMA buffer window as VRAM.

Could you tell us more about your system and settings? Maybe we’ll be able to spot what is causing the crashes. Or, try to follow the instructions linked in that post, and see where that gets you.

Justin_Weiss · September 1, 2024, 6:21pm

Sure thing. I followed that post to see if it would help, or at least make things more consistent.

I’m running on an Arch host (6.10.7-arch1-1) running AUTOMATIC1111 in a Ubuntu 24.04 distrobox.

I just added sg_display=0 to the kernel parameters, which seems to have made things more stable. After a GPU reset the system is much more likely to recover when I have that parameter. I have UMA set to game mode, but it didn’t seem to make a difference either way.

512x512 generates pretty consistently without problems. 768x768 and higher will usually cause a GPU reset sometime during generation – the screen goes blank, KWin tells me it had to restart, and SD stops generating until I kill and restart it (before I added the sg_display parameter, this would usually hang the whole system and show video corruption)

The distrobox is running ROCm 6.2, but I’ve seen the same symptoms since 5.7. I’m running with HSA_OVERRIDE_GFX_VERSION=11.0.0 ./webui.sh

I also tried the iommu=soft kernel parameter, it didn’t seem to make a difference. I also tried setting /sys/class/drm/card0/device/power_dpm_force_performance_level to high, but that would cause the entire machine to shut off when I started a generation. I also tried a memtest86+ run, which it passed.

dmesg shows this during a reset:

[ 1214.308511] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1214.308516] amdgpu 0000:c1:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ 1214.308518] amdgpu 0000:c1:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 1214.308520] amdgpu 0000:c1:00.0: amdgpu: Failed to evict queue 1
[ 1214.308526] amdgpu: Failed to evict process queues
[ 1214.308527] amdgpu: Failed to quiesce KFD
[ 1214.308550] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 1214.398324] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:231
[ 1214.465471] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:223
[ 1216.474195] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1216.474203] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1218.478556] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1218.478567] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1220.483074] amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[ 1220.483084] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 1220.755651] [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[ 1220.757538] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 1220.792892] amdgpu 0000:c1:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1220.793481] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[ 1220.793599] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[ 1220.796664] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[ 1220.798728] [drm] DMUB hardware initialized: version=0x08004000
[ 1220.810819] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:223
[ 1220.813168] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:231
[ 1220.815548] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:239
[ 1220.817927] amdgpu 0000:c1:00.0: [drm] REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line:247
[ 1221.308284] [drm] kiq ring mec 3 pipe 1 q 0
[ 1221.310941] amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 1221.311527] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1221.311529] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1221.311530] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1221.311531] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 1221.311533] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 1221.311533] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 1221.311534] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 1221.311536] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 1221.311537] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 1221.311538] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1221.311539] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 1221.311541] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 1221.311542] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 1221.313645] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow start
[ 1221.313646] amdgpu 0000:c1:00.0: amdgpu: recover vram bo from shadow done
[ 1221.313662] amdgpu 0000:c1:00.0: amdgpu: GPU reset(7) succeeded!

It’s a similar error to this one, which is why I thought it wasn’t possible to get to work on these APUs.

Wrybill_Plover · September 15, 2024, 1:46am

I am not using Arch these days, but to investigate this, I did some testing on Arch directly, as well as in Ubuntu 22.04 and 24.04 distrobox environments on NixOS, and also directly on NixOS. Admittedly, it’s a bit like apples to oranges to peaches (?) comparison, on top of not being particularly rigorous. But here are my observations, for what it’s worth:

The Ubuntu 22.04 distrobox with Python 3.10 and ROCm 5.6, under NixOS, seemed to be the most stable. InvokeAI (the version I installed a while ago - not sure about the most up-to-date one) was fine generating images up to 1024x1024 and using models up to 7.2GB in size (the largest ones I had that my InvokeAI supported). I was able to crash it, eventually, but I would say it was almost dependable, for text2image inference.

ComfyUI was easier to crash, but still handled those same cases Ok. With larger models, such as Flux and Flux derived ones, it was mixed success: some crashed on loading, some in VAE, and some successfully produced 1024x1024 images.

Forge was more successful with Flux than Comfy, but still crashed on loading the larger sizes. The error messages on crashes were similar to the ones listed by @Justin_Weiss, and in the AMD bug report he linked.

Ubuntu 24.04 distrobox with ROCm 6.1 and bare NixOS with 6.0 were less stable still. 6.1 was, probably, the worst. 6.0 felt more usable, and I was able to generate 1024x1024 images with (moderately) large SDXL models in ComfyUI, although not reliably. 6.1 would crash almost immediately.

On Arch, the same pattern remained: 6.0 was more stable than 6.1, ComfyUI was able to generate larger images with the SDXL and other ~6-7GB models. But would also crash every now and then.

All in all, the state of memory handling throughout the amdgpu/ROCm/SD stack, and the reliability and fit of the different component versions still leave much room for improvement. Hopefully, the queue-related bug report referenced above will be addressed - although it’s not clear if the problem is, indeed, in amdgpu, or elsewhere in the stack. Newer versions of ROCm seem to handle issues worse than the older ones on this platform, but, perhaps it’s also a matter of configuration - I used everything with default settings - or maybe of a version match with other components…

I did not have the amdgpu.sg_scatter=0 on either NixOS or Arch: I removed that a couple of kernel versions ago, and adding it back, at least on NixOS, didn’t improve the stability. My system is FW13 with AMD 7840u, and 64GB of RAM. The kernel versions were 6.10.7 on NixOS and 6.10.8 on Arch

Also, a useful tip: if your inference is constantly crashing, or is suddenly running very slow, regardless of what you do, clean the ~/.config/miopen folder. I spent several hours trying to understand why my previously working under ROCm 6.0 ComfyUI setup started to run excruciatingly slow after a crash, before finally tracing it to the memory cache in that folder. Ironically, the presence of the cache files does make loading larger models more stable - so, if you have to delete them, running an inference with a smaller model first might help to load a larger model later…