[GUIDE] Running Stable Diffusion on AMD 7840U

bud · January 9, 2024, 1:08am

I recently tried running Stable Diffusion to try test a stubborn eGPU, and while that still isn’t working I did manage to get it working on the AMD Framework iGPU. I thought I would share because it is pretty nifty, and I did a lot of unnecessary things. I thought could save other people trouble if they were interested.

Your mileage may vary as needed for whatever distro you’re using, but this is what I did.

The main struggle is the amount of RAM available to the iGPU. There are two ways to address this:
a. Enable game mode in BIOS, which will allot 4GB RAM as VRAM for the iGPU.
b. Trim down any VRAM-hogging programs. I noticed my browser was the biggest culprit, even with only 1 empty tab open. If you have any Electron apps those will probably be big problems too. I decided to set one browser to be CPU-only, and use that while using the iGPU. I used Firefox, which can be set to avoid the GPU by opening Settings, scrolling to near the bottom, and uncheck both “Performance” checkboxes. This doesn’t make for a great multimedia browsing experience, but I have another browser for that.
c. You can check VRAM using with radeontop (which we will install in a later step). You’ll need roughly 2.5GB free, though if you get close to filling it up it seems the iGPU is more likely to crash. I have 320MB used at the moment, mostly by GNOME if I had to guess.
Decide where to install. I am using Silverblue and Distrobox, and decided to make a container for this. I used Ubuntu 22.04 because this is supposed to have ideal compatibility with ROCm for AI on AMD hardware, though I’m not sure how much it matters. Much to my surprise, I did not have to deal with /dev/kfd permissions or anything. Simply:

distrobox create --name igpu --home ~/podhome/igpu --image ubuntu:22.04
distrobox enter igpu

(I like to keep the home directory separate).

Install some things if you don’t have them:

sudo apt install git radeontop

Set up a Python virtual environment. This assumes you already have Python. Version 3.11 in my case but anything recent should work:

cd ~
python -m venv pyenv
./pyenv/bin/pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7
git clone https://github.com/comfyanonymous/ComfyUI.git
./pyenv/bin/pip install -r ComfyUI/requirements.txt

Then download a Stable Diffusion checkpoint, such as Copax TimeLess - XPlus_4 | Flux Checkpoint | Civitai. mv it into ComfyUI/models/checkpoints/.
Now run ComfyUI with options that maximally limit the amount of VRAM that gets used:

HSA_OVERRIDE_GFX_VERSION=11.0.0 ../pyenv/bin/python main.py --novram --cpu-vae

Note that we use HSA_OVERRIDE_GFX_VERSION=11.0.0 because the 780m iGPU is gfx1103 (version 11.0.3) which ROCm does not support, but in my experience using the override to tell ROCm to pretend it is gfx1100 seems to work without issue.

Mostly done! You can go to https://localhost:8188 and see the ComfyUI interface. You can try a simple workflow like this one: { "last_node_id": 10, "last_link_id": 18, "nodes": [ { "id" - Pastebin.com
I said mostly done because you may experience a little “warm-up” issue. Often, when generating the first image, the screen will go black for a second, return for a second, black again for a second, and then return. This is the iGPU crashing and resetting, as seen in dmesg. It successfully resets, but after that Stable Diffusion will be hung and usually you’ll need to restart it. Typically it works on the second try.

Inference is much faster on the iGPU. After the first image (which goes slower due to model loading), I got 195.3 seconds to generate an image with the workflow linked above. Using the --cpu ComfyUI option, the same workflow took 1215.28 seconds. I have energy-saving settings on so it could be a bit faster, but in any case the iGPU is over 6x faster.

I only wish there was a way to further increase RAM available to the GPU, as it could be faster (and more stable) if everything including the VAE (variational autoencoder) could be offloaded to the iGPU. 4GB seems like an arbitrary limit. I guess I also wish I could use the AI co-processor built into this chip yet sitting idly by.

Bonus: I also updated my kernel, though I doubt this was necessary for the iGPU. It was needed for an eGPU to connect, but in any case, so I don’t forget: on Silverblue one can download the Rawhide kernels without debugging enabled from here: Index of /pub/alt/rawhide-kernel-nodebug/x86_64 (kernel, kernel-core, kernel-modules, kernel-modules-core, kernel-modules-extra is what I used). Then overlay with:

sudo rpm-ostree override replace ./kernel*.rpm

Sample image from above workflow (I know nothing about using Stable Diffusion, just here to test some ROCm functionality, so yes it is bad):

Kyle_Reis · January 9, 2024, 2:16am

The system will dynamically adjust how much RAM is allocated to the iGPU depending on how much is actually needed. Up to half of the system ram will be allocated to the iGPU(ie. if you have 32 GB of ram then 16 GB is available to the iGPU).

The problem is that some programs (mainly older and/or poorly written programs) don’t understand that more ram will be allocated to the iGPU when needed, so they freak out because they see that the current amount of ram allocated to the iGPUisn’t enough for what the program will need in the future.

UMA Game Optimized mode (which is a bad name IMO) is simply a workaround to force a minimum of 4 GB to be allocated to the iGPU even when it’s not needed. For most of the programs that can’t cope with dynamically allocated ram having 4 GB is enough to prevent them from freaking out and throwing an error. The main exception (as you have experienced) are certain AI/ML programs.

Allowing for more than 4 GB to be set in the BIOS would be a workaround, however the proper solution is for the programs to be updated to fix this.

Shiroudan · January 9, 2024, 2:26am

It’s impressive you got it to run on an AMD GPU at all! I couldn’t manage after trying for a couple days (mostly pytorch issues I believe).

Out of curiosity, what are your resolution and iteration settings here?

Loell_Framework · January 9, 2024, 11:11am

Thanks for sharing this one @bud , marking as guide.

bud · January 9, 2024, 3:34pm

Indeed, that might be a better solution. Unfortunately PyTorch is one of those older and/or poorly written programs when it comes to ROCm. I did experiment a bit with editing the PyTorch tools/amd_build/build_amd.py script and recompiling it, as my understanding is that it should be easy to use hipMallocManaged instead of hipMalloc on the C++ side to use UMA. Did not succeed (yet) though.

1024x1024, with 25 iterations. It is around 6 seconds/iteration on the iGPU vs. over 30 seconds/iteration on CPU. The VAE decode on CPU takes an additional 30-60 seconds of that total time, so more iterations is definitely possible without a huge increase in time, if you are patient.

Wrybill_Plover · January 11, 2024, 1:41pm

I created this request a while back: BIOS Feature Request: Add ability to specify UMA size on AMD APUs

But, if someone succeeds to get PyTorch to use dynamic VRAM allocation from the GTT with full support for something like ComfyUI or InvokeAI, that would be awesome.

Please let us know if (when?) you do!

Kieran_Levin · January 31, 2025, 9:22am

You can try increasing the GTT pool with something like:


/etc/modprobe.d/increase_amd_memory.conf

#Otherwise it's capped to only half the RAM
options amdgpu gttsize=90000 #in MB
options ttm pages_limit=22500000 #4k per page, 90GB total
options ttm page_pool_size=22500000

Jim_Hauxwell · March 21, 2025, 1:34am

Anyone had comfyUI running on 780m recently? I’m having no luck at all and am wondering if its just me.

Wrybill_Plover · March 24, 2025, 8:34pm

What does your software stack look like? And what issues are you experiencing?

Jim_Hauxwell · March 27, 2025, 11:55pm

I’m running Fedora 43 (rawhide)

Everything is installed and I can get to the UI. When I run a Flux model, it uses the DualClipLoader at at this point it SIGSEGV’s. If I override the HSA_OVERRIDE_GFX_VERSION to various versions around 11.0.x I get different types of crashes, but am unable to get further. I just wonder if I need to wait for AMD to release a correctly supported rocm driver for this device, or whether there’s some magic to get it going.

Wrybill_Plover · April 9, 2025, 2:34am

Sorry about the delay.

What helped me in seemingly similar situations was deleting everything in ~/.config/miopen/ Maybe give that a try?

Ron_McMillian · June 17, 2025, 2:39am

I was able to get Comfyui working today with openwebui/ollama in Docker after poking at it for a few days so it does work (technically). I’m dealing with what seems like random crashes under load but GTT seems to be working correctly. I’m hoping changing to game mode helps a bit.

Topic		Replies	Views
Arch Stable-Diffusion Setup Help Linux arch	7	1167	September 15, 2024
Stable Diffusion / ROCm / PyTorch Setup Linux ubuntu	17	7769	February 14, 2025
BIOS Feature Request: Add ability to specify UMA size on AMD APUs Framework Laptop 13 feature-requests , bios	69	9972	May 19, 2025
VRAM allocation for the 7840U frameworks Framework Laptop 13	27	11320	August 13, 2024
Stable diffusion running on the Framework Laptop using opencl General Topics	12	6333	February 5, 2023

[GUIDE] Running Stable Diffusion on AMD 7840U

Related topics