I’m trying to get Stable Diffusion running on my FW16 (with the 7700S), but I’m having some trouble. I’ve tried to follow this guide (Installing ROCm / HIPLIB on Ubuntu 22.04 - #2 by cepth), using ROCm 5.7 on Ubuntu 22.04, but whenever I try running something CUDA related I get RuntimeError: No HIP GPUs are available. I’m a bit new to CUDA/torch/ML in general and so I’m not familiar with the details.
It doesn’t seem that gfx1102 is officially supported in this (or any?) version of ROCm, but I was wondering if there were any unofficial workarounds or if anyone had managed to sort it out on their own machine.
Can you try installing ROCm 6.x? And, are your environment variables properly set up?
I’ve run Stable Diffusion (A1111) on the 7700S with no problems. What SD implementation are you trying to install? If you’re using A1111, be sure to follow the ROCm specific instructions.
I’ll try ROCm 6.1 now (though it wasn’t working earlier for me).
I have a script to set some environment variables, and it currently looks like this (I use fish):
set -x HSA_OVERRIDE_GFX_VERSION 11.0.2
set -x HCC_AMDGPU_TARGET gfx1102
set -x PYTORCH_ROCM_ARCH gfx1102
set -x AMDGPU_TARGETS gfx1102
set -x TRITON_USE_ROCM ON
set -x ROCM_PATH /opt/rocm-5.7.0
set -x ROCR_VISIBLE_DEVICES 0
set -x HIP_VISIBLE_DEVICES 0
set -x USE_CUDA 0
I am trying to use the A1111 version and I’m (trying to) use the ROCm-specific instructions (under the Running natively header).
Ah cool! Hadn’t put two and two together and realised you wrote that guide. Thanks
I’ve also tried ROCm 6.1 now, as well as using 1100 /11.0.0 for all the env vars, but still no luck as of yet. I’ve noticed that by the looks of it, clinfo is showing 0 devices, so maybe that’s where the problem is: I’ll check tomorrow.
I’ve tried with ROCm 6.1(.0) and the following env vars:
set -x HSA_OVERRIDE_GFX_VERSION 11.0.0
set -x HCC_AMDGPU_TARGET gfx1100
set -x PYTORCH_ROCM_ARCH gfx1100
set -x AMDGPU_TARGETS gfx1100
set -x TRITON_USE_ROCM ON
set -x ROCM_PATH /opt/rocm-6.1.0
set -x ROCR_VISIBLE_DEVICES 0
set -x HIP_VISIBLE_DEVICES 0
set -x USE_CUDA 0
I’ve been (mostly) following the instructions in the README for A1111, except instead of the final command I’m installing torch in the venv manually from the ROCm repo with pip3 install torch==2.1.2 torchvision==0.16.1 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1/, and then running launch.py with the arguments suggested. I’m still getting errors that indicate that my GPU isn’t being detected somehow.
When I run rocm-smi, I get:
=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
Name (20 chars) (Edge) (Avg) (Mem, Compute)
========================================================================================================================
0 [0x0007 : 0xc1] 33.0°C 24.0W N/A, N/A 803Mhz 96Mhz 29.8% auto 100.0W 0% 0%
0x7480
1 [0x0005 : 0xc1] 35.0°C 21.048W N/A, N/A None 1000Mhz 0% auto Unsupported 83% 5%
0x15bf
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================
When you installed ROCm, you’re sure you specified the correct use cases? I.e. sudo amdgpu-install --usecase=graphics,rocm,hip,mllib. Are you using the amdgpu-install method, or another one?
I’m also a Fish shell user, but A1111’s default launcher is going to use the Bash shell. Note how the first line (aka the “shebang line”) specifies Bash shell. You’re going to have to add these environment variables to your .bashrc file, because it’s the Bash shell that’s actually executing the launch script.
If you get the chance, please run that PyTorch benchmarking suite I mentioned.
Yes, I used exactly amdgpu-install --usecase=graphics,rocm,hip,mllib and rebooted afterwards.
With set -x the environment variables should propagate to child processes anyway, but I added to the bashrc to be sure, ran again and the same error occurred.
I got RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx. In general, I seem to be getting errors whenever torch._C._cuda_init() is run.
Thanks for your help so far, it’s much appreciated
With set -x the environment variables should propagate to child processes anyway, but I added to the bashrc to be sure, ran again and the same error occurred.
I’m not sure this is the case. If you run the env command in Fish shell, and then explicitly run env in Bash shell, you’ll find the outputs differ.
After you added the env variables to .bashrc, did you run source ~/.bashrc? More generally, could you confirm that running env from within Bash shell produces the relevant env variables?
Are you sure that you’re installing the ROCm version of PyTorch within the venv that A1111 uses? To confirm (assuming you’re using Fish):
Navigate to the directory where you’ve cloned A1111
Activate the venv with source venv/bin/activate.fish
I did end up running the webui with --skip-torch-cuda-test just to check whether that was working, and I think at least a few of the packages were installed then.
The first pip list output you sent was with the venv activated. That one shows the wrong (non-ROCm) version of PyTorch (listed as torch) though.
The second pip list output has the correct torch version, but given the lack of any of the normal packages installed with A1111 (like gradio, which provides the UI), I’m guessing the second pip list was run with no venv activated?
When the venv is activated, your shell (in Fish) will show:
(venv) username@machine-name ...
Try activating the venv from within the A1111 directory, and then install the ROCm versions of torch and torchvision.
I should probably provide a final update on this (for those who find this on Google when trying to solve a similar problem…).
The problem was that I didn’t have ROCm installed properly. Make sure to restart between installation changes, make sure that the versions are all consistent, and it should be fine. I now have everything working with 6.0 and can run A1111 and ComfyUI with no problems.