[TRACKING] Hard freezing on Fedora 36 with the new 12th gen system

@nrp

Had my first hard lock-up earlier (though, I have previously seen similar but brief lockups, that ended under 20 seconds.) This time, full lock, with screen still showing frozen display (i.e. didn’t blank), and CPU fan kicked up really quick into high after about 1 minute of sitting locked; after this, noticeable temp ramp over maybe 1.5 to 2 minutes, thru chassis. Forced power off w/ long button press, due to concern about heat.

  1. Fedora 36
  2. 5.18.18-200.fc36.x86_64
  3. I did set that exact ALS sensor workaround that you mention, via the grubby method found in Fedora install guide. Otherwise, kernel untouched.
  4. SN850
  5. Just browsing w/ Firefox (maybe there was a terminal open on another workspace, but nothing running in it, if so.) Maybe 10 or so tabs, some sites with probably mildly heavy javascript (Indeed, browser client of Discord, nothing else with any weight at all.)
1 Like

Possibly an unrelated issue, but I’ve been encountering brief freezes where the system locks up for 1-2 seconds and then recovers, on Arch lxde, using xorg. The journalctl error is similar to the logs reported by @M4X here:

Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000
Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] HuC authenticated
Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Aug 23 23:02:27 arichard kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled

However, the logs I posted lack the timeout, i.e. it seems to restart successfully and continue instead of freezing permanently. AFAICT, the brief freezes no longer occur if I downgrade to the LTS kernel (5.19.3 → 5.15.62).

  • Distro: Arch Linux with lxde (xorg)
  • Kernel: Brief freezes on 5.19.3; no issues so far on 5.15.62
  • Unusual kernel line: n/a
  • SN850
  • Circumstances: far higher rate of freezing when I have many internet tabs open.

I’m not sure how helpful this will be for others, however I haven’t had any issues now for 2 days. So perhaps it will.

Looking into the behavior I was seeing and testing multiple DEs, for me at least the problems seemed centered around Gnome.

Looking into what was out there, at least in Manjaro, there seems to be some issues between the “Gnome 4x UI Enhancements” and “Dash to Dock” extensions.

I disabled both and have had a much better experience so far.

Knock on wood but so far I have had no more lock-ups, no more few second freezes, and resuming from deep sleep comes back up correctly with out any sort of weird flickering, glitches or crashes inside of Gnome.

So currently the only problem I have left is that sometimes I reboot and the external monitor is detected and other times I have to un-plug and re-plug the docs usb-c cable.

  1. Fedora 36
  2. 5.18.19-200.fc36
  3. Haven’t messed with this yet.
  4. SN850
  5. Usually happens in Gnome settings or the system settings app in Plasma. On Gnome, which I gave up on based on personal preferences) the lockups would be complete. No keyboard input registers, requires a hard power reset to recover. On Plasma you can still get to a virtual console and poke around (and do things like kill the desktop session so you can start another one). I have had some other random display locks on the device while in a Plasma session, usually when I was using an electron app. These were different in that the entire screen would lock for maybe 10-20 seconds, then unfreeze like nothing happened. I haven’t been able to reliably reproduce those freezes.

Has anyone tried the stuff outlined here yet?

I just went through their steps. I need some time to see if I can get it to lock up, but I’m keeping my fingers crossed.

Seems like I resolved the problem (so far one day without any crash, even with VA-API enabled).

My solution: add this kernel param:

i915.request_timeout_ms=60000

If it still doesn’t work, add this:

i915.request_timeout_ms=60000 intel_iommu=off i915.reset=0 i915.enable_psr=0 i915.enable_fbc=0 i915.disable_power_well=1 i915.enable_guc=0

If still doesn’t work, run this on boot:

echo 400 | sudo tee /sys/class/drm/card0/gt_min_freq_mhz;
echo 10000 | sudo tee /sys/class/drm/card0/engine/rcs0/preempt_timeout_ms;
echo 10000 | sudo tee /sys/class/drm/card0/engine/rcs0/heartbeat_interval_ms;
echo 1000 | sudo tee /sys/class/drm/card0/engine/rcs0/stop_timeout_ms;
Old answer

I’ve been working on this issue for over a month now. What I observed:

  • It hangs more frequently if you use HW acceleration decoder (va-api for example)
  • Disable GuC Submission does make it a bit stable (add kernel param i915.enable_guc=2). See Intel graphics - ArchWiki
  • I remap Alt+F12 to trigger a session logout with gnome-session-quit --force command, so whenever the GPU is hangs, I don’t need to restart the whole computer.
  • Since the logout method above works fine, I suspect that GNOME does not handle GPU context reset correctly. Let’s hopt that it will be fixed by GNOME.
  • I added these kernel param: 915.enable_psr=0 i915.enable_fbc=0 i915.disable_power_well=1 intel_iommu=igfx_off. It seems to be a bit more stable.

My system: Framework laptop i7-1260P, 16GBx2 RAM, Samsung SSD 980 Pro 1TB, Fedora 36 with stock kernel 5.18.17
P/s: I tried Xanmod kernel 5.19, nothing changes.

2 Likes

For the record and to help lead to a discovery of the solution:

My wife’s laptop is also freezing but while running Windows 11, no blue screen or errors are logged. No firmware updates are available for the SSD either. Running on a 12th gen, i7-1260P, 32GB RAM, 1TB SN850. I’ve run memory tests and all of the windows tests/scans for fixing errors.

For the record Iam using 12th gen 1240p ( samsung 980 for 24hours and played extensively with it under Ubuntu mate 22.04.
I had a stuttering of the mouse at one point, only strange thing i had, I ended reseting the bios parameters.
I played with various kernel and compositors changed number of cpu from bios and linux switshed off boost and optimised powertop…played a game … and never had a complet lock freeze.
It really feel solide like my thinkpad …for the first 24h.

Have you experienced any over current protection behaviour with USB devices so far on the 12th gen board?

This has just happened to me on Fedora Silverblue. Only happened once thus far

  1. Fedora 36 Silverblue
  2. kernel 5.19.4
  3. sudo grubby --update-kernel=ALL --args="module_blacklist=hid_sensor_hub"
  4. SK Hynix Gold P31
  5. Open GNOME Settings (was downloading stuff in the background)

Out of curiosity, did you try the Gnome changes I made above?
I haven’t had any more issues since then and I had found a bug for how those extensions interacted with each other.

I have no extensions

  1. Fedora 36
  2. Linux fedora 5.19.4-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Aug 25 17:42:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  3. Freezing started when I rebooted with hid_sensor_hub blacklisted, so I reverted the change and it no longer occurs. No other kernel parameters have been added
  4. SK Hynix P41
  5. Gnome Settings, when modifying the settings for the touchpad

Wanted to update here, in case it is useful to anyone: About a week ago, I switched my UI scaling in GNOME display settings back to 100%, and have not noticed any lockups since, be they soft or hard. I had previously been scaling to 125% after enabling the option to do so, by issuing gsettings set org.gnome.mutter experimental-features "['scale-monitor-framebuffer']"Currently on 7 days uptime.

Surprisingly, the forum’s search function doesn’t show hits in this topic for “scale” or “scaling”… It does seem like a large portion of people posting here are using GNOME, on various distros, so it might be something worth trying for folks… It is an experimental feature, after all.

@nrp

  1. Arch Linux running GNOME 42.3 (on Wayland 1.26)
  2. Linux 5.19.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 31 Aug 2022 22:09:40 +0000 x86_64 GNU/Linux
  3. kernel line contains nvme.noacpi=1. Additionally, /etc/modprobe.d has a config file to blacklist hid_sensor_hub
  4. Sabrent Rocket 4.0 2TB with fw version RKT401.3
  5. Was trying to change settings in gnome-control-center while listening to music. A few other windows were up in the background (browser, terminal, nautilus). The music player is foobar2000 v2.0-beta-3 (32-bit), a Windows application running through Wine 7.0 via flatpak.

When I pulled up gnome’s settings app, the system completely froze. The music I was listening to kept playing even though the gpu hanged and graphics were frozen. within seconds, the CPU fan started getting loud and was going on full blast. I couldn’t regain control, so I had to hold down power until the system shut down ungracefully. Here are the journalctl entries right when the crash happened:

Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:84dffffb, in gnome-control-c [17179]
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] gnome-control-c[17179] context reset due to GPU hang
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] HuC authenticated
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Sep 01 16:34:28 OverARCHing-Framework kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled

I doubt this has anything at all to do with the SSD; this seems to be an issue specific to GNOME on Wayland on i915 GPU driver. Where in the graphics stack this is happening isn’t clear, but the stopped heartbeat on rcs0 followed by rcs0 reset request timed out issue has shown up on Google searches going back to at least 5 years in one form or another.

1 Like

Definitely seems to be something with GNOME. Had the issue on Fedora 36 but not when I switched to using Fedora 36’s KDE spin.

Happened again. Exactly identical conditions to trigger it as my previous post above: playing music while navigating settings in gnome-control-center, same kernel/OS/DM/WM/Wayland.

Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:84dffffb, in gnome-control-c [30624]
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] gnome-control-c[30624] context reset due to GPU hang
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] HuC authenticated
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Sep 02 20:19:37 kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled

This time there are some slight differences as the previous time I triggered this freeze:

  1. kernel line contains the additional parameter i915.request_timeout_ms=60000
  2. modprobe.d config file blacklisting hid_sensor_hub was disabled (meaning the module was loaded normally this time around)

So the i915.request_timeout_ms param does nothing @ngxson, at least that one on its own, and the hid_sensor_hub has nothing to do with the issue either. I haven’t tested the other params yet, but I’ve enabled sysrq unraw command (set the value kernel.sysctl=4 in a config file in /etc/sysctl.conf).

Next time this happens I’m going to attempt to grab control of the keyboard (alt+prtscr+r) and then ctrl+c to attempt to kill the display manager and everything spawned by it to give control back to PID1. At the very least, it will enable a semi-graceful ctrl+alt+del reboot. (If that doesn’t work, then I’ll set kernel.sysrq=132 to enable an ungraceful soft reboot, but hopefully we won’t need to go that far.)

==EDIT== This happened, and the unraw plus sigint trick worked, but sigint (ctrl+c) can cause corruption. It’s better to try switching to a different TTY and then switching back after unraw.

2 Likes

Something seems to be in the works (I just googled, don’t fully understand any of it really):
https://lore.kernel.org/all/DM4PR11MB5971A43B5E78F34B30EA5E1587729@DM4PR11MB5971.namprd11.prod.outlook.com/t/

2 Likes

yikes…

1 Like