New interesting data point today, though not the expected one. I went the other direction, installing Windows on my Framework to see if the eGPU would work as expected, and it does! Seems to work perfectly for C++ machine learning libraries.
That seems to rule out hardware as the issue. Something about Linux doesn’t like the eGPU…
It might be related to this: Unable to run with 7900 xtx · Issue #2746 · ROCm/ROCm · GitHub (which is not eGPU-specific) except that sometimes it happens even with low VRAM utilization, not the 17+ GB they report. Now that I know it isn’t the hardware, I will see if I can figure out exactly what crashes it, since I have been able to run very very small programs on the eGPU (e.g., sum of a vector with 1k elements).
UPDATE: I upgraded kernel 6.7rc7 → 6.7.0 and now it all works perfectly? Really did not expect much changes given everything I’d heard about few kernel changes over the holidays. But indeed it works! I’ve stress tested it with up to 23GB matrix sums, the gpu-burn HIP example, and finally running LLMs up to 23888MB VRAM utilization and it all works just as well as Windows now.
Could even mark this as resolved now! I mean, we’ll see if I’m being hasty, but it looks very very good so far.
I upgraded to ROCm 6 almost the day it came out, which did fix the PCIe atomics issues so that I was able print from within kernels even without compiling with -mprintf-kind=buffered. So I don’t think that was it, but I will admit I had deleted my previous container and started with a new one after upgrading to kernel 6.7, so it is 100% possible that I reinstalled something better in the new container. I’m not sure what usecase= options I had selected before during install, for example…
Also it turns out PyTorch nightly still doesn’t work, but I haven’t investigated too much there since my main use case is C++ which works seemingly perfectly now. I will eventually, though, since likely it is some HIP call in the PyTorch backend anyway.
Thanks again for all your help Mario. I would not have ventured into trying new kernels otherwise. Probably would have just returned the GPU!
Update: As of kernel 6.8.8 and ROCm 6.1.0, it works beautifully even with PyTorch (nightly version, anyway)!
PyTorch was the last part not really working, and it is a bit hard to disentangle kernel vs. ROCm vs. PyTorch updates because I upgraded from kernel 6.8.4 → 6.8.8, ROCm 6.0.3 → 6.1, and PyTorch nightly to a more recent night, all at the same time. But in any case, it works now (the last issue I observed was this bug).
A new oddity is that the laptop usually won’t boot with the eGPU plugged in. I noticed this after a recent kernel upgrade. Approximately 1/10 times it will boot, which seems to be related to whether the eGPU gets position 2/3 in the rocm_agent_enumerator output or position 3/3 (i.e., HIP_VISIBLE_DEVICES=0 vs. 1). Formerly it used to show up in position #2 most of the time, though occasionally #3, and now in the rare cases it boots, it is always #3, so I think probably it is failing in the #2 cases. No matter, though, hot-plugging is working better than ever – it used to not play nice with ROCm – so I just unplug it before starting/restarting and then plug in after boot.
In conclusion, would finally recommend a 7900XTX eGPU for small-scale machine learning on Linux, as it is terrific compute per dollar and now not difficult to use. I’m running Ubuntu 22.04 in a DistroBox container from Fedora 40 Silverblue as host, if anyone is curious. Transfer to/from the eGPU is a bit slow with 1x PCI, but once the data are there training and running machine learning models is very fast.
It hasn’t yet. It will probably go in 6.10-rc1 and then backpor to stable after that. But if you want to grab sooner you can confirm it really fixes things