System: Ryzen™ AI Max+ 395 - 128GB
OS: Ubuntu 24.04 LTS
Drives: WD_BLACK SN850X 8000GB & WD_BLACK SN850X 2000GB
Display: Running X11 and not Wayland.
This system has been rock solid running on kernel 6.17.0-19 and previous kernels for months now. But, with kernels 6.17.0-20 or 6.17.0-22 it will just randomly hang. Sometimes it will randomly reboot a minute or so after the hang, other times I will have to remove power and reboot.
There are no artifacts left over from crash that I can find. Logs all seem to stop at the point of the crash. It doesn’t appear to be a hardware problem as it runs flawlessly on 0-19.
Curious if anybody else is seeing this behavior. And if anybody might know the cause and a cure.
I’m not running Wayland as this system runs VMWare Workstation ( Latest version ) very heavily and Workstation doesn’t work so hot under Wayland.
I don’t really have a way of doing this. This is my daily driver and VMWare Workstation doesn’t cooperate with Wayland in a usable fashion.
I will try that debug mask tomorrow morning when I boot up and see how things go. Sometimes the system will run fine for several days before it locks up. I’ll make the change and keep an eye on things.
Sorry this is happening to you. This should be a supplier or retailer problem, not something consumers like us should have to deal with.
3.0.4 firmware didn’t fix it.
3.0.5 firmware didn’t fix it.
Wayland didn’t fix it.
kernel flags didn’t fix it.
Upgrading to 26.04 with Kernel 7 didn’t fix it.
We waited 4 months to go from 3.0.4 to 3.0.5 and we get “memory improvements” and “fixed boot time.” Meanwhile, the GPU and fabric flood events continue to stream in and those of us that bought the machine to actually use it are left with a $4000 paper weight.
Here’s the list of flags I’ve been asked to throw at this. None of them fixed it. YMMV:
GPU crashes
amdgpu.dcdebugmask=0x10
amdgpu.gpu_recovery=1
amdgpu.mes=0
Fabric floods
pcie_ports=native
pcie_ecrc=on
My “most stable” version was kernel 6.17.0-19 on firmware 3.0.3. Didn’t get much time to run below 3.0.3 because the system automatically updated when I got it and I haven’t spent the additional time to roll back. But, to the point, I shouldn’t have to.
Yup, on days where I can’t afford downtime I’m running 6.17.0-19 because it has not crashed on me yet.
I’m assuming this is most definitely a kernel or kernel firmware issue as it’s happening on more than just the FW Desktop according to reports I’m finding on the Internet.
I’m still running firmware 3.0.3 as I tend to wait awhile before updating firmware versions.
I guess I’m sort of stuck to a point in time now until somebody figures out what’s going on.
This system has been rock solid running on kernel 6.17.0-19 and previous kernels for months now. But, with kernels 6.17.0-20 or 6.17.0-22 it will just randomly hang. Sometimes it will randomly reboot a minute or so after the hang, other times I will have to remove power and reboot.
I’d ask the Ubuntu/Canonical people, given this behavior. At a guess, some new back port probably broke something. When they backport so much there’s a lot of room for things to go sideways.
Keep in mind chromium will be in non obvious places - any electron application could potentially trigger it too if it happens to use hardware video decode for something.
Hmm… I’m pretty sure every time this has happened there has been a video playing in Chrome. I’ll probably try disabling accelerated video decode in Chrome and try running a newer kernel over the weekend.
@Mario_Limonciello , wrote up something here the other day that hints Chromium as well – Slack in Electron. Don’t know if this helps. Saw you were active in that Gitlab thread. Don’t think this provides any smoking gun, but it is supporting evidence (I think).