[TRACKING] Hard freezing on Fedora 36 with the new 12th gen system

KevSlashNull · November 29, 2022, 8:16pm

This also happened to me about once a week since I bought the Framework.

Ubuntu 22.04.1
5.15.0-53-generic #59-Ubuntu SMP
module_blacklist=hid_sensor_hub
Samsung SSD 980 PRO 1 TB
Randomly in Firefox, but yesterday I installed Kerbal Space Program (KSP), which seems to reliably cause the laptop to freeze a few minutes after game launch.

Random GPU hang in Firefox/VS Code/normal usage:

Okt 16 12:01:37 kevs-framework kernel: Asynchronous wait on fence 0000:00:02.0:gnome-shell[2584]:4fee timed out (hint:intel_atomic_commit_ready [i915])
Okt 16 12:01:41 kevs-framework kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000
Okt 16 12:01:41 kevs-framework kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0

GPU hang while playing KSP:

Nov 29 20:52:35 kevs-framework kernel: Asynchronous wait on fence 0000:00:02.0:gnome-shell[3148]:cd6e timed out (hint:intel_atomic_commit_ready [i915])
Nov 29 20:52:39 kevs-framework kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:84dffffb, in KSP.x86_64 [5367]
Nov 29 20:52:39 kevs-framework kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Nov 29 20:52:39 kevs-framework kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Nov 29 20:52:39 kevs-framework kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Nov 29 20:52:39 kevs-framework kernel: i915 0000:00:02.0: [drm] Renderer[5929] context reset due to GPU hang
Nov 29 20:52:39 kevs-framework kernel: i915 0000:00:02.0: [drm] KSP.x86_64[5367] context reset due to GPU hang

The random hang is usually recoverable by waiting 30-60 seconds, while the one while playing KSP (12:1:84dffffb) is not and requires a forced shutdown and boot.

vhx · November 30, 2022, 10:54am

Looks like you need to update kernel to >=6.0.9. ecode 12:0:00000000 should be resolved once thats done. Not sure if it’ll fix KSP, but worth a shot!

just a reminder; we’re running the latest 12th gen intel. We’re not going to find the required support or bug fixes in old kernels. This was a main driver why i moved to Fedora many years ago; much newer kernels for latest hardware.

real_or_random · November 30, 2022, 4:00pm

Was this a hard freeze or did the system come back? What was the ecode (or better the full log)?

Matt_Hartley · November 30, 2022, 5:21pm

Just to reiterate my own experiences:

So much this.

PDXTabs · November 30, 2022, 7:25pm

FWIW I’ve just been running vanilla mainline Linux kernels on my Ubuntu 22.04 equipped framework. You can find them here: Index of /mainline

Instructions here: How to Install the Latest Linux Kernel on Ubuntu & Linux Mint?

Nicholas_La_Roux · December 1, 2022, 1:39am

Quick update here, still experiencing freezes that automatically recover after about 10 seconds on Fedora 37 with 6.0.10 kernel (latest).

KevSlashNull · December 1, 2022, 3:10pm

Thanks for the help @vhx! I’ve installed kernel 6.0.9 on my Framework (yes, 12th gen) and it seems to have fixed the ecode 12:0:00000000, although I’ll know for sure in a few weeks. As for KSP, I’ve played it yesterday evening for like one or two hours with no freezes!

vhx · December 1, 2022, 3:24pm

i assume it’s generating an ecode; what is it? dmesg | grep -i ecode probably easiest way to find out.

egalanos · December 3, 2022, 5:44am

Whilst I haven’t been having GPU issues under F37 due to my relatively simple usage, seeing the ongoing posts made me think I should mention the debugging resources I had on my list of things to try in case the problem persisted.

Increase the level of logging with additional kernel command line parameters:
- drm.debug=0xe
  - Run modinfo drm to see the options
- log_buf_len=4M
- Source: https://01.org/linuxgraphics/documentation/bugs-and-debugging/tips-may-help-solve-your-issue-less-time
Capturing errors
- Prepare by installing igt-gpu-tools
- Capture error dumps:
  - cat /sys/class/drm/card*/error | gzip > gpu-error.gz
  - Source: https://01.org/linuxgraphics/documentation/how-get-gpu-error-state
- Run intel_error_decode to then decode an error dump
Online resources
- Issues tracker: Issues · drm / intel · GitLab
- https://01.org/linuxgraphics/documentation/bugs-and-debugging
  
  (look at the sections on the left side bar)
- https://01.org/linuxgraphics/documentation/development/how-debug-suspend-resume-issues

Hope that is helpful…

Nicholas_La_Roux · December 3, 2022, 11:48am

Just captured the error from a freeze (removed patch). Ocurred instantly after resuming from sleep.

~ 
❯ dmesg | grep -i ecode
[    1.261967] pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    3.120876] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem

vhx · December 3, 2022, 5:07pm

@Nicholas_La_Roux - that appears to be normal and part of the bootup (dmesg is showing 1.26 and 3.12 seconds since the last reboot). I get those on my laptop too without any stability issues.

Based on your output, it looks like your freezing isnt generating an ecode, so something else going on imo.
when you next get the freezing, try looking at dmesg (maybe dmesg | tail -n 30 for the most recent 30 lines) to see if that returns anything useful.

Nicholas_La_Roux · December 4, 2022, 7:53am

As requested, here’s a 30 line log from a fresh freeze after waking from sleep.

~ 
❯ dmesg | tail -n 30 
[  170.400447] printk: Suspending console(s) (use no_console_suspend to debug)
[  170.738650] PM: suspend devices took 0.338 seconds
[  170.780242] ACPI: EC: interrupt blocked
[15093.792913] ACPI: EC: interrupt unblocked
[15093.964695] i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
[15093.964700] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[15093.980823] i915 0000:00:02.0: [drm] HuC authenticated
[15093.981699] i915 0000:00:02.0: [drm] GuC submission enabled
[15093.981702] i915 0000:00:02.0: [drm] GuC SLPC enabled
[15093.982556] i915 0000:00:02.0: [drm] GuC RC: enabled
[15093.999625] nvme nvme0: 16/0/0 default/read/poll queues
[15094.258581] PM: resume devices took 0.302 seconds
[15094.258599] OOM killer enabled.
[15094.258601] Restarting tasks ... 
[15094.262029] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[15094.262273] done.
[15094.262293] random: crng reseeded on system resumption
[15094.262874] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[15094.295923] PM: suspend exit
[15095.160030] usb 3-9: reset full-speed USB device number 4 using xhci_hcd
[15095.473015] usb 3-9: reset full-speed USB device number 4 using xhci_hcd
[15097.335864] wlp166s0: authenticate with 38:8b:59:e1:9c:e4
[15097.344328] wlp166s0: Invalid HE elem, Disable HE
[15097.358014] wlp166s0: send auth to 38:8b:59:e1:9c:e4 (try 1/3)
[15097.449025] wlp166s0: authenticated
[15097.456286] wlp166s0: associate with 38:8b:59:e1:9c:e4 (try 1/3)
[15097.462186] wlp166s0: RX AssocResp from 38:8b:59:e1:9c:e4 (capab=0x1011 status=0 aid=8)
[15097.470861] wlp166s0: associated
[15097.511095] wlp166s0: Limiting TX power to 20 (20 - 0) dBm as advertised by 38:8b:59:e1:9c:e4
[15097.511192] IPv6: ADDRCONF(NETDEV_CHANGE): wlp166s0: link becomes ready

vhx · December 4, 2022, 3:50pm

@Nicholas_La_Roux nothing that looks abnormal to me. seems sucessful and no problem; no ‘GPU BUG’ or ecode number.

So… a few things i’d look at next to hopefully get more information:

memtest ram.
disabling one of the nvme power management features. i cant remember any specifics but possibly the pcie_aspm=off boot option, but there could be more i’m unaware of.
firmware updates. Specifically an update for your nvme drive.

pcie_aspm=off is used for trying to find your issue. It shouldn’t be used as a long term solution since it disables the entire PCIe Active State Power Management system - most likely reduced battery life and possibly slight heat increase.

i’d say faulty ram would bring more issues other than resume problems so thats less likely but a typical go-to with system stability issues. at the moment, power management feels more likely based on your resume specific issues.

might be worth asking incase others have similar hardware or know of a fix; what nvme (model & size) and ram (model, size & speed) are you using?

Kelby_Faessler · December 4, 2022, 11:47pm

@Matt_Hartley I think you’re correct the hard freezing is a chrome/chromium issue. I currently seem to be able to reproduce the hard freeze by reopening tabs from my last chrome session. Writing this on firefox.

Not sure how to get vainfo on fedora but my setup is:

12th Gen Intel
Fedora 37
Wayland
kernel 6.0.10-300.fc37.x86_64

Logs from my last boot where it hung are:

Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849f3c04, in chrome [4627]
Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] chrome[4627] context reset due to GPU hang
Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] HuC authenticated
Dec 04 17:59:17 fedora kernel: i915 0000:00:02.0: [drm] GuC submission enabled

Kelby_Faessler · December 4, 2022, 11:57pm

Here are some logs from where it hung but recovered. Not sure if this gives any extra info but I thought the crash annotation line was interesting as it mentions VAAPI, which I got hits for when googling vainfo

Dec 04 18:49:01 fedora kernel: Asynchronous wait on fence 0000:00:02.0:gnome-shell[1941]:19aac timed out (hint:intel_atomic_commit_ready [>
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] HuC authenticated
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Dec 04 18:49:04 fedora kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled
Dec 04 18:49:04 fedora firefox.desktop[3355]: Crash Annotation GraphicsCriticalError: |[0][GFX1-]: glxtest: VA-API test failed: failed to initialise VAAPI connection. (t=0.236754) |[1][GFX1-]: GFX: RenderThread detected a device reset in PostUpdate (t=2779.62) [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate

Kelby_Faessler · December 5, 2022, 12:11am

More logs after another reboot, restart of chrome with same tabs. This time it recovered after several seconds of complete freeze (e.g. 3-10 seconds)

Dec 04 19:04:15 fedora google-chrome.desktop[3363]: [3358:3358:1204/190415.602027:ERROR:interface_endpoint_client.cc(694)] Message 0 reject
ed by interface blink.mojom.WidgetHost
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.629418:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncPar
ametersIfAvailable() failed for 1 times!
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.633467:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncPar
ametersIfAvailable() failed for 2 times!
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.637816:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncPar
ametersIfAvailable() failed for 3 times!
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.852305:ERROR:shared_image_factory.cc(575)] Could not find Shared
ImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_mem
ory
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.857867:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.863965:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.872283:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.877785:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
Dec 04 19:04:16 fedora google-chrome.desktop[3363]: [3476:3476:1204/190416.893201:ERROR:shared_image_factory.cc(575)] Could not find SharedImageBackingFactory with params: usage: Gles2|Raster|DisplayRead|Scanout, format: BGRA_8888, share_between_threads: 0, gmb_type: shared_memory
Dec 04 19:04:18 fedora chronyd[915]: Selected source 81.21.65.168 (2.fedora.pool.ntp.org)
Dec 04 19:04:22 fedora gnome-character[3211]: JS LOG: Characters Application exiting
Dec 04 19:04:27 fedora kernel: Asynchronous wait on fence 0000:00:02.0:gnome-shell[1866]:bdc timed out (hint:intel_atomic_commit_ready [i915])
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849f3c04, in chrome [3476]
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] chrome[3476] context reset due to GPU hang
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
Dec 04 19:04:31 fedora google-chrome.desktop[3363]: [3476:3476:1204/190431.843449:ERROR:shared_context_state.cc(859)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_GUILTY_CONTEXT_RESET_KHR
Dec 04 19:04:31 fedora google-chrome.desktop[3363]: [3476:3476:1204/190431.843828:ERROR:gpu_service_impl.cc(988)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] HuC authenticated
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] GuC submission enabled
Dec 04 19:04:31 fedora kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled
Dec 04 19:04:31 fedora google-chrome.desktop[3363]: [3723:1:1204/190431.850353:ERROR:command_buffer_proxy_impl.cc(325)] GPU state invalid after WaitForGetOffsetInRange.
Dec 04 19:04:31 fedora google-chrome.desktop[3363]: [3685:1:1204/190431.850573:ERROR:command_buffer_proxy_impl.cc(325)] GPU state invalid after WaitForGetOffsetInRange.
Dec 04 19:04:31 fedora google-chrome.desktop[3363]: [3358:3358:1204/190431.862238:ERROR:gpu_process_host.cc(990)] GPU process exited unexpectedly: exit_code=8704
Dec 04 19:04:31 fedora google-chrome.desktop[3363]: libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null)
Dec 04 19:04:32 fedora google-chrome.desktop[3363]: [4127:4127:1204/190432.000425:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 1 times!
Dec 04 19:04:32 fedora google-chrome.desktop[3363]: [4127:4127:1204/190432.001932:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 2 times!
Dec 04 19:04:32 fedora google-chrome.desktop[3363]: [4127:4127:1204/190432.002491:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 3 times!

vhx · December 5, 2022, 8:37am

dnf provides vainfo. you need the libva-utils package. Firefox Hardware acceleration - Fedora Project Wiki should help more.

your other info is interesting, kernel 6.0.10, but seeing the ecode 12:0:00000000 error. i’ve not seen it since i upgraded to the 6.0.9 kernel.

fwiw i’ve tried vscodium (afaik chromium based) for around 12hours (some hours in use, mostly idle in background) and no issues with any ecode, which seems to be chrom{e|ium}/app specific based on other responses in this thread.

As you say, there are hints that missing vaapi libs could be related. worth a shot with getting libva setup. When you run vainfo, check for the VLD & EncSlice output lines which indicates vaapi is working.

Elmo · December 5, 2022, 1:17pm

It was a hard freeze had to power down and restart. I forgot to take logs but will next time if it happens.

Rubion · December 5, 2022, 4:09pm

Hello,

I’ve received the Framework laptop last week. I’m really happy with it, but I’ve got similar issues as discussed in this topic. I run arch, gnome, kernel 6.0.11. What I also run into is an issue that I don’t see discussed here: when I the nightlight on and off, the top of the display flickers white.

I’ve made a post about it here, mentioning this post, among another. So far I’ve not received a reply.

Do you also see this behaviour regarding nightlight?

Matt_Hartley · December 5, 2022, 5:08pm

Replied. Try suggested to rule out hardware.