Hi,
I’m posting this as a sort of meta issue, since I can’t really pin it down to a single problem or cause. I’ll try to keep this updated, since Debian 13 release is getting closer (first freeze stage starts in a month) and more people are likely going to upgrade and perhaps run into the same issues.
The past year
I got the laptop last year around this time, and although I’m happy with the device overall, the GPU keeps giving me trouble, seemingly with no end in sight.
It started very bad: weekly crashes that led to filesystem corruption and unbootable system. But it turned out the problems were caused by old amdgpu firmware in Debian 12. After sorting that out and updating to BIOS 3.05, it got much better, so I summarized the important steps regarding firmware updates at Debian wiki.
Then I had about 6 perfect months with no crashes (which means it is unlikely my hardware is defective; memtest86+ also finishes clean) – until one day the machine froze again. And then again a few weeks later. I must have done some partial upgrades though backports in the mean time (I don’t remember for sure), possibly introducing a regression. As the hardware is still relatively new, I thought more bugs are probably being resolved, and it would be better to do a full upgrade.
So a month ago I updated to Debian 13 (trixie / testing, kernel 6.12.12, amdgpu firmware 20241210, Mesa 24.3.4-3, Xfce 4.20, xserver-xorg 1:7.7+24) – and it got even worse.
Current problems
Sometimes it takes days between crashes, but on the other extreme, two days ago I experienced 3 to 4 different types of crashes in a single day:
-
sudden freeze: happens most often while watching video or when the machine is under load; the screen simply freezes (in one case followed by image corruption – a few large blocks of the now static image were moved to a different place). Only the mouse cursor still moves. As if to make a point, I got one as I was preparing this post – luckily saved in a text editor and not in a web form.
-
black screen: after blindly typing password to xscreensaver while an external display wakes up, I find that only the cursor is visible (and still moves) and everything else is black. Possibly same as “sudden freeze”, but happening while screen is locked and therefore black? I ran umrgui over SSH and the only thing out of normal was on the Buffer Objects tab: normally it shows around 4 copies of my desktop and one smaller bitmap with the cursor bitmap, but now it had only the desktop bitmap, also black). Triggering
amdgpu_gpu_recover
did not help (after recovery the screen stayed black). -
white screen: external screen is frozen, laptop screen turned white. Happened to me only once, 40 minutes after I rebooted the laptop from the “black screen” freeze. Unlike the “black screen” crash, triggering GPU recovery did help and everything worked fine after that.
-
HW accelerated video playback (4K AV1, possibly others) now also results in a GPU hang – here I wanted to write “but it automatically recovers a few seconds later”, but when I went to confirm that, the recovery triggered but failed to recover. So “usually recovers”, but can also hang completely.
-
Apart from all that, I found out that my NVMe drive also randomly drops PCIe speed a few days after boot (
sudo lspci -vv -s 02:00.0|grep LnkSta:
showsLnkSta: Speed 2.5GT/s (downgraded), Width x2 (downgraded)
). But that may or may not be an issue on the drive’s side, I have no way to tell.
Workarounds
Starting with the SSD issue, I found that instead of a full reboot, you can force the link speed to be renegotiated by resetting only the NVMe drive:
# cd /sys/devices/pci0000:00/0000:00:02.4/0000:02:00.0/
# echo bus > reset_method
# echo 1 > reset
I did not find any less “brute force” way to do it, but it seems to work well for me with no side effects (though your mileage may vary – I have no way to tell if all drives can take this gracefully). If it works well for you, you could set it up as an hourly cron job. (Don’t forget to change the PCI device path if you came here through search and have a different laptop.)
For the GPU side, I tried some older known workarounds (booting with amdgpu.sg_display=0
and setting VRAM allocation to “gaming mode”), with no difference.
Sometimes the automatic GPU recovery kicks in (mainly with the accelerated video crash), but if it doesn’t, you may try to trigger it manually (as root) over SSH, or perhaps by switching to TTY (if it still works):
# cat /sys/kernel/debug/dri/0000\:c1\:00.0/amdgpu_gpu_recover
If it does not help, kiss your unsaved data goodbye and reboot (having all sysrq flags required for SysRq+REISUB enabled may be helpful – it’s a safer way to hard-reboot than just holding the power button).
Trying to reset the GPU the same way as the SSD just breaks things even further, so do not bother.
Fixes and outlook
I found several threads with similar problems (one even popped up today, before I finished this post, and the log seems similar to the “white screen” freeze), but they rarely reach a conclusion, or end up suspecting a hardware issue that is never confirmed. The only fix I can think of is downgrading to an older kernel or firmware release, but with time between crashes ranging from minutes to weeks, it’s pretty hard to conclusively find a combination that works (and won’t be obsolete in a few months due to lack of security updates).
Although most of the bugs seem to be related to the GPU and I have a netconsole pointed to another machine, I have no idea if the cause lies with the kernel, amdgpu driver, Mesa, BIOS, or something else. So while I would like to at least report the issues, I’m not sure where to send my logs (except perhaps here – see following post). I also have umr installed, but apart from running umrgui and looking at pretty graphs I don’t really know how to get anything useful from it. If anyone could give me some pointers in this regard, I can try to collect more data when the next crash comes.
I generally try to be patient with Linux hardware support (especially considering Debian’s “always out of date” status), but at this point the 7640U/7840U is almost two years old, and even with new kernel and firmware it’s seemingly not getting any better. Is there an expectation that the iGPU will eventually be stable, or is the platform inherently “fragile” in some way that makes it hard to add support for newer platforms (these days probably Zen 5) without breaking the older ones?
Thanks!