Stable Diffusion / ROCm / PyTorch Setup

Hi!

I’m trying to get Stable Diffusion running on my FW16 (with the 7700S), but I’m having some trouble. I’ve tried to follow this guide (Installing ROCm / HIPLIB on Ubuntu 22.04 - #2 by cepth), using ROCm 5.7 on Ubuntu 22.04, but whenever I try running something CUDA related I get RuntimeError: No HIP GPUs are available. I’m a bit new to CUDA/torch/ML in general and so I’m not familiar with the details.

It doesn’t seem that gfx1102 is officially supported in this (or any?) version of ROCm, but I was wondering if there were any unofficial workarounds or if anyone had managed to sort it out on their own machine.

Thanks!

Can you try installing ROCm 6.x? And, are your environment variables properly set up?

I’ve run Stable Diffusion (A1111) on the 7700S with no problems. What SD implementation are you trying to install? If you’re using A1111, be sure to follow the ROCm specific instructions.

I’ll try ROCm 6.1 now (though it wasn’t working earlier for me).

I have a script to set some environment variables, and it currently looks like this (I use fish):

set -x HSA_OVERRIDE_GFX_VERSION 11.0.2
set -x HCC_AMDGPU_TARGET gfx1102
set -x PYTORCH_ROCM_ARCH gfx1102
set -x AMDGPU_TARGETS gfx1102
set -x TRITON_USE_ROCM ON

set -x ROCM_PATH /opt/rocm-5.7.0
set -x ROCR_VISIBLE_DEVICES 0
set -x HIP_VISIBLE_DEVICES 0
set -x USE_CUDA 0

I am trying to use the A1111 version and I’m (trying to) use the ROCm-specific instructions (under the Running natively header).

I’ve got it to work with setting the device id to 1100 instead of 1103

In my guide, I set all the device related versions to 11.0.0/gfx1100 (see step 7).

As you noted, there’s no official support in ROCm for any consumer cards except the RX 7900 XTX/XT (which are gfx1100).

It’s just not going to work if you set it to 11.0.2/gfx1102.

Ah cool! Hadn’t put two and two together and realised you wrote that guide. Thanks :slight_smile:

I’ve also tried ROCm 6.1 now, as well as using 1100 /11.0.0 for all the env vars, but still no luck as of yet. I’ve noticed that by the looks of it, clinfo is showing 0 devices, so maybe that’s where the problem is: I’ll check tomorrow.

When you run rocm-smi, does anything come up?

Additionally, if you’re installing/switching between versions of ROCm be sure to reboot after installation.

I’ve tried with ROCm 6.1(.0) and the following env vars:

set -x HSA_OVERRIDE_GFX_VERSION 11.0.0
set -x HCC_AMDGPU_TARGET gfx1100
set -x PYTORCH_ROCM_ARCH gfx1100
set -x AMDGPU_TARGETS gfx1100
set -x TRITON_USE_ROCM ON

set -x ROCM_PATH /opt/rocm-6.1.0
set -x ROCR_VISIBLE_DEVICES 0
set -x HIP_VISIBLE_DEVICES 0
set -x USE_CUDA 0

I’ve been (mostly) following the instructions in the README for A1111, except instead of the final command I’m installing torch in the venv manually from the ROCm repo with pip3 install torch==2.1.2 torchvision==0.16.1 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1/, and then running launch.py with the arguments suggested. I’m still getting errors that indicate that my GPU isn’t being detected somehow.

When I run rocm-smi, I get:

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  [Model : Revision]    Temp    Power    Partitions      SCLK    MCLK     Fan    Perf  PwrCap       VRAM%  GPU%
        Name (20 chars)       (Edge)  (Avg)    (Mem, Compute)                                                       
========================================================================================================================
0       [0x0007 : 0xc1]       33.0°C  24.0W    N/A, N/A        803Mhz  96Mhz    29.8%  auto  100.0W         0%   0% 
        0x7480                                                                                                      
1       [0x0005 : 0xc1]       35.0°C  21.048W  N/A, N/A        None    1000Mhz  0%     auto  Unsupported   83%   5% 
        0x15bf                                                                                                      
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

Couple of questions:

  1. When you installed ROCm, you’re sure you specified the correct use cases? I.e. sudo amdgpu-install --usecase=graphics,rocm,hip,mllib. Are you using the amdgpu-install method, or another one?
  2. I’m also a Fish shell user, but A1111’s default launcher is going to use the Bash shell. Note how the first line (aka the “shebang line”) specifies Bash shell. You’re going to have to add these environment variables to your .bashrc file, because it’s the Bash shell that’s actually executing the launch script.
  3. If you get the chance, please run that PyTorch benchmarking suite I mentioned.
  1. Yes, I used exactly amdgpu-install --usecase=graphics,rocm,hip,mllib and rebooted afterwards.
  2. With set -x the environment variables should propagate to child processes anyway, but I added to the bashrc to be sure, ran again and the same error occurred.
  3. I got RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx. In general, I seem to be getting errors whenever torch._C._cuda_init() is run.

Thanks for your help so far, it’s much appreciated :slight_smile:

With set -x the environment variables should propagate to child processes anyway, but I added to the bashrc to be sure, ran again and the same error occurred.

I’m not sure this is the case. If you run the env command in Fish shell, and then explicitly run env in Bash shell, you’ll find the outputs differ.

After you added the env variables to .bashrc, did you run source ~/.bashrc? More generally, could you confirm that running env from within Bash shell produces the relevant env variables?


Are you sure that you’re installing the ROCm version of PyTorch within the venv that A1111 uses? To confirm (assuming you’re using Fish):

  1. Navigate to the directory where you’ve cloned A1111
  2. Activate the venv with source venv/bin/activate.fish
  3. Run python3 -m pip list

Can you post the output?

More generally, could you confirm that running env from within Bash shell produces the relevant env variables?

When I spawn a bash shell from my fish shell (by typing bash in fish), then the env variables are all available.


The output I get from python3 -m pip list is:

Package                   Version
------------------------- ------------
accelerate                0.21.0
aenum                     3.1.15
aiofiles                  23.2.1
aiohttp                   3.9.5
aiosignal                 1.3.1
altair                    5.3.0
antlr4-python3-runtime    4.9.3
anyio                     3.7.1
async-timeout             4.0.3
attrs                     23.2.0
blendmodes                2022
certifi                   2024.6.2
charset-normalizer        3.3.2
clean-fid                 0.1.35
click                     8.1.7
clip                      1.0
contourpy                 1.2.1
cycler                    0.12.1
deprecation               2.1.0
diskcache                 5.6.3
einops                    0.4.1
exceptiongroup            1.2.1
facexlib                  0.3.0
fastapi                   0.94.0
ffmpy                     0.3.2
filelock                  3.15.4
filterpy                  1.4.5
fonttools                 4.53.0
frozenlist                1.4.1
fsspec                    2024.6.1
ftfy                      6.2.0
gitdb                     4.0.11
GitPython                 3.1.32
gradio                    3.41.2
gradio_client             0.5.0
h11                       0.12.0
httpcore                  0.15.0
httpx                     0.24.1
huggingface-hub           0.23.4
idna                      3.7
imageio                   2.34.2
importlib_resources       6.4.0
inflection                0.5.1
Jinja2                    3.1.4
jsonmerge                 1.8.0
jsonschema                4.22.0
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
kornia                    0.6.7
lark                      1.1.2
lazy_loader               0.4
lightning-utilities       0.11.3.post0
llvmlite                  0.43.0
MarkupSafe                2.1.5
matplotlib                3.9.0
mpmath                    1.3.0
multidict                 6.0.5
networkx                  3.3
numba                     0.60.0
numpy                     1.26.2
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.20.5
nvidia-nvjitlink-cu12     12.5.40
nvidia-nvtx-cu12          12.1.105
omegaconf                 2.2.3
open-clip-torch           2.20.0
opencv-python             4.10.0.84
orjson                    3.10.5
packaging                 24.1
pandas                    2.2.2
piexif                    1.1.3
Pillow                    9.5.0
pillow-avif-plugin        1.4.3
pip                       22.0.2
protobuf                  3.20.0
psutil                    5.9.5
pydantic                  1.10.17
pydub                     0.25.1
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-multipart          0.0.9
pytorch-lightning         1.9.4
pytz                      2024.1
PyWavelets                1.6.0
PyYAML                    6.0.1
referencing               0.35.1
regex                     2024.5.15
requests                  2.32.3
resize-right              0.0.2
rpds-py                   0.18.1
safetensors               0.4.2
scikit-image              0.21.0
scipy                     1.14.0
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                69.5.1
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
spandrel                  0.1.6
starlette                 0.26.1
sympy                     1.12.1
tifffile                  2024.6.18
timm                      1.0.7
tokenizers                0.13.3
tomesd                    0.1.3
toolz                     0.12.1
torch                     2.3.1
torchdiffeq               0.2.3
torchmetrics              1.4.0.post0
torchsde                  0.2.6
torchvision               0.18.1
tqdm                      4.66.4
trampoline                0.1.2
transformers              4.30.2
triton                    2.3.1
typing_extensions         4.12.2
tzdata                    2024.1
urllib3                   2.2.2
uvicorn                   0.30.1
wcwidth                   0.2.13
websockets                11.0.3
yarl                      1.9.4

I did end up running the webui with --skip-torch-cuda-test just to check whether that was working, and I think at least a few of the packages were installed then.

Right, I’ve tried again from scratch with ROCm 6.0.0, updating the env vars appropriately, making sure to reboot after reinstallation, etc.

Now when I run python3 -m pip list, I get:

Package             Version
------------------- --------------
filelock            3.13.1
fsspec              2024.2.0
Jinja2              3.1.3
MarkupSafe          2.1.5
mpmath              1.3.0
networkx            3.2.1
numpy               1.26.3
pillow              10.2.0
pip                 22.0.2
pytorch-triton-rocm 2.3.1
setuptools          59.6.0
sympy               1.12
torch               2.3.1+rocm6.0
torchaudio          2.3.1+rocm6.0
torchvision         0.18.1+rocm6.0
typing_extensions   4.9.0

which has a pytorch-triton-rocm and torch=2.3.1+rocm6.0 in, but the problem persists.

So I think I have a sense of what’s going wrong.

The first pip list output you sent was with the venv activated. That one shows the wrong (non-ROCm) version of PyTorch (listed as torch) though.

The second pip list output has the correct torch version, but given the lack of any of the normal packages installed with A1111 (like gradio, which provides the UI), I’m guessing the second pip list was run with no venv activated?

When the venv is activated, your shell (in Fish) will show:

(venv) username@machine-name ...

Try activating the venv from within the A1111 directory, and then install the ROCm versions of torch and torchvision.

I should probably provide a final update on this (for those who find this on Google when trying to solve a similar problem…).

The problem was that I didn’t have ROCm installed properly. Make sure to restart between installation changes, make sure that the versions are all consistent, and it should be fine. I now have everything working with 6.0 and can run A1111 and ComfyUI with no problems.

Thank you for all your help cepth!

Glad to hear it!

ROCm 6.2 (just released) officially supports Ubuntu 24.04 now as well.