[GUIDE] Running Stable Diffusion on AMD 7840U

bud · January 9, 2024, 1:08am

I recently tried running Stable Diffusion to try test a stubborn eGPU, and while that still isn’t working I did manage to get it working on the AMD Framework iGPU. I thought I would share because it is pretty nifty, and I did a lot of unnecessary things. I thought could save other people trouble if they were interested.

Your mileage may vary as needed for whatever distro you’re using, but this is what I did.

The main struggle is the amount of RAM available to the iGPU. There are two ways to address this:
a. Enable game mode in BIOS, which will allot 4GB RAM as VRAM for the iGPU.
b. Trim down any VRAM-hogging programs. I noticed my browser was the biggest culprit, even with only 1 empty tab open. If you have any Electron apps those will probably be big problems too. I decided to set one browser to be CPU-only, and use that while using the iGPU. I used Firefox, which can be set to avoid the GPU by opening Settings, scrolling to near the bottom, and uncheck both “Performance” checkboxes. This doesn’t make for a great multimedia browsing experience, but I have another browser for that.
c. You can check VRAM using with radeontop (which we will install in a later step). You’ll need roughly 2.5GB free, though if you get close to filling it up it seems the iGPU is more likely to crash. I have 320MB used at the moment, mostly by GNOME if I had to guess.
Decide where to install. I am using Silverblue and Distrobox, and decided to make a container for this. I used Ubuntu 22.04 because this is supposed to have ideal compatibility with ROCm for AI on AMD hardware, though I’m not sure how much it matters. Much to my surprise, I did not have to deal with /dev/kfd permissions or anything. Simply:

distrobox create --name igpu --home ~/podhome/igpu --image ubuntu:22.04
distrobox enter igpu

(I like to keep the home directory separate).

Install some things if you don’t have them:

sudo apt install git radeontop

Set up a Python virtual environment. This assumes you already have Python. Version 3.11 in my case but anything recent should work:

cd ~
python -m venv pyenv
./pyenv/bin/pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7
git clone https://github.com/comfyanonymous/ComfyUI.git
./pyenv/bin/pip install -r ComfyUI/requirements.txt

Then download a Stable Diffusion checkpoint, such as Copax TimeLessXL - SDXL1.0 - V8 | Stable Diffusion Checkpoint | Civitai. mv it into ComfyUI/models/checkpoints/.
Now run ComfyUI with options that maximally limit the amount of VRAM that gets used:

HSA_OVERRIDE_GFX_VERSION=11.0.0 ../pyenv/bin/python main.py --novram --cpu-vae

Note that we use HSA_OVERRIDE_GFX_VERSION=11.0.0 because the 780m iGPU is gfx1103 (version 11.0.3) which ROCm does not support, but in my experience using the override to tell ROCm to pretend it is gfx1100 seems to work without issue.

Mostly done! You can go to https://localhost:8188 and see the ComfyUI interface. You can try a simple workflow like this one: { "last_node_id": 10, "last_link_id": 18, "nodes": [ { "id" - Pastebin.com
I said mostly done because you may experience a little “warm-up” issue. Often, when generating the first image, the screen will go black for a second, return for a second, black again for a second, and then return. This is the iGPU crashing and resetting, as seen in dmesg. It successfully resets, but after that Stable Diffusion will be hung and usually you’ll need to restart it. Typically it works on the second try.

Inference is much faster on the iGPU. After the first image (which goes slower due to model loading), I got 195.3 seconds to generate an image with the workflow linked above. Using the --cpu ComfyUI option, the same workflow took 1215.28 seconds. I have energy-saving settings on so it could be a bit faster, but in any case the iGPU is over 6x faster.

I only wish there was a way to further increase RAM available to the GPU, as it could be faster (and more stable) if everything including the VAE (variational autoencoder) could be offloaded to the iGPU. 4GB seems like an arbitrary limit. I guess I also wish I could use the AI co-processor built into this chip yet sitting idly by.

Bonus: I also updated my kernel, though I doubt this was necessary for the iGPU. It was needed for an eGPU to connect, but in any case, so I don’t forget: on Silverblue one can download the Rawhide kernels without debugging enabled from here: Index of /pub/alt/rawhide-kernel-nodebug/x86_64 (kernel, kernel-core, kernel-modules, kernel-modules-core, kernel-modules-extra is what I used). Then overlay with:

sudo rpm-ostree override replace ./kernel*.rpm

Sample image from above workflow (I know nothing about using Stable Diffusion, just here to test some ROCm functionality, so yes it is bad):

Kyle_Reis · January 9, 2024, 2:16am

The system will dynamically adjust how much RAM is allocated to the iGPU depending on how much is actually needed. Up to half of the system ram will be allocated to the iGPU(ie. if you have 32 GB of ram then 16 GB is available to the iGPU).

The problem is that some programs (mainly older and/or poorly written programs) don’t understand that more ram will be allocated to the iGPU when needed, so they freak out because they see that the current amount of ram allocated to the iGPUisn’t enough for what the program will need in the future.

UMA Game Optimized mode (which is a bad name IMO) is simply a workaround to force a minimum of 4 GB to be allocated to the iGPU even when it’s not needed. For most of the programs that can’t cope with dynamically allocated ram having 4 GB is enough to prevent them from freaking out and throwing an error. The main exception (as you have experienced) are certain AI/ML programs.

Allowing for more than 4 GB to be set in the BIOS would be a workaround, however the proper solution is for the programs to be updated to fix this.

Shiroudan · January 9, 2024, 2:26am

It’s impressive you got it to run on an AMD GPU at all! I couldn’t manage after trying for a couple days (mostly pytorch issues I believe).

Out of curiosity, what are your resolution and iteration settings here?

Loell_Framework · January 9, 2024, 11:11am

Thanks for sharing this one @bud , marking as guide.

bud · January 9, 2024, 3:34pm

Indeed, that might be a better solution. Unfortunately PyTorch is one of those older and/or poorly written programs when it comes to ROCm. I did experiment a bit with editing the PyTorch tools/amd_build/build_amd.py script and recompiling it, as my understanding is that it should be easy to use hipMallocManaged instead of hipMalloc on the C++ side to use UMA. Did not succeed (yet) though.

1024x1024, with 25 iterations. It is around 6 seconds/iteration on the iGPU vs. over 30 seconds/iteration on CPU. The VAE decode on CPU takes an additional 30-60 seconds of that total time, so more iterations is definitely possible without a huge increase in time, if you are patient.

Wrybill_Plover · January 11, 2024, 1:41pm

I created this request a while back: BIOS Feature Request: Add ability to specify UMA size on AMD APUs

But, if someone succeeds to get PyTorch to use dynamic VRAM allocation from the GTT with full support for something like ComfyUI or InvokeAI, that would be awesome.

Please let us know if (when?) you do!