OK, I’ve finally found a reproducible way to trigger the complete system freeze, and it’s… an amdgpu bug, because of course it is.
Symptoms are always:
- Any sound playing continues to play
- Screen goes dark, keyboard is completely unresponsive (e.g. no capslock)
- EC is up and running, can change keyboard backlight
- Any open network sockets and hardware still stay up (SSH sessions still established)
- Can force an emergency sync over SSH but can’t even shut down cleanly, always needs to be hard powered off
The freeze happens to me at least daily, but if I run through a powerpoint deck with videos I can always get it to trigger within 10 minutes - so I need to use my old laptop unfortunately for any presentations.
It doesn’t seem to matter if I have freeworld or stock Mesa.
Fedora 40 is completely vanilla and up-to-date.
The freeze sometimes just happens even when there’s no video content playing. It’s a matter of time, but having video playing will make it want to come out and play sooner.
It’s always happened with an external display (via a TB4 dock or the HDMI card or a USB-C dongle with HDMI) - I haven’t tested it on just the internal display but will try that tomorrow.
I can’t seem to reproduce this on the internal screen - both at my desk (TB4 dock to monitor) and in the lounge (FW HDMI card to TV) where the issue can be tripped the display is connected via an HDMI port. I wonder if this is or increases the risk of the problem happening? I should test via the DP expansion card during the week.
I had dmesg streaming into a serial console to capture the crash (since it never flushes the logs to disk):
[ 134.055722] rc rc0: DP-4 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.0/rc/rc0
[ 134.055774] input: DP-4 as /devices/pci0000:00/0000:00:08.1/0000:c1:00.0/rc/rc0/input15[ 134.062390] usb 1-1: New USB device found, idVendor=32ac, idProduct=0002, bcdDevice= 0.00
[ 134.062398] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 134.062401] usb 1-1: Product: HDMI Expansion Card
[ 134.062404] usb 1-1: Manufacturer: Framework
[ 134.062407] usb 1-1: SerialNumber: 11AD1D004095401821270B00
[ 134.142059] hid-generic 0003:32AC:0002.0005: hiddev96,hidraw0: USB HID v1.11 Device [Framework HDMI Expansion Card] on usb-0000:c1:00.3-1/input1
[ 727.853581] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=89703, emitted seq=89705
[ 727.854033] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 727.854374] amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!
[ 733.669549] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
[ 733.669561] amdgpu 0000:c1:00.0: amdgpu: Failed to disable gfxoff!
[ 736.107506] ------------[ cut here ]------------
[ 736.107514] WARNING: CPU: 0 PID: 929 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn314/dcn314_smu.c:159 dcn314_smu_send_msg_with_param+0x108/0x1b0 [amdgpu]
[ 736.108015] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer uhid nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr bnep sunrpc binfmt_misc vfat fat snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_core snd_hda_codec_realtek snd_compress mt7921e snd_hda_codec_generic ac97_bus snd_hda_codec_hdmi intel_rapl_msr mt7921_common snd_pcm_dmaengine snd_hda_intel intel_rapl_common mt792x_lib snd_pci_ps snd_intel_dspcfg mt76_connac_lib cros_ec_lpcs snd_intel_sdw_acpi mt76 edac_mce_amd cros_ec snd_rpl_pci_acp6x snd_hda_codec snd_acp_pci snd_hda_core btusb mac80211 snd_acp_legacy_common snd_hwdep snd_pci_acp6x kvm_amd btrtl snd_seq btintel hid_sensor_als hid_sensor_trigger snd_pci_acp5x btbcm libarc4 snd_seq_device kvm btmtk
[ 736.108124] hid_sensor_iio_common irqbypass bluetooth cfg80211 snd_pcm snd_rn_pci_acp3x snd_acp_config industrialio_triggered_buffer snd_timer snd_soc_acpi snd kfifo_buf industrialio amd_pmf wmi_bmof pcspkr soundcore snd_pci_acp3x thunderbolt rapl rfkill amdtee k10temp i2c_piix4 amd_sfh tee platform_profile amd_pmc joydev loop nfnetlink zram dm_crypt amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched nvme drm_suballoc_helper drm_buddy nvme_core drm_display_helper crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni video ucsi_acpi polyval_generic hid_sensor_hub hid_multitouch ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 typec_ucsi ccp sp5100_tco cec typec nvme_auth wmi i2c_hid_acpi i2c_hid serio_raw ip6_tables ip_tables fuse i2c_dev
[ 736.108230] CPU: 0 PID: 929 Comm: kworker/u32:16 Not tainted 6.8.9-300.fc40.x86_64 #1
[ 736.108235] Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.05 03/29/2024
[ 736.108238] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[ 736.108252] RIP: 0010:dcn314_smu_send_msg_with_param+0x108/0x1b0 [amdgpu]
[ 736.108659] Code: be 93 62 01 00 5d 41 5c 41 5d e9 13 a8 de ff 44 89 ea 48 c7 c6 28 19 3a c1 48 c7 c7 30 d5 ef c0 e8 5d 1e e7 e5 e9 48 ff ff ff <0f> 0b 48 8b 3b b9 80 84 1e 00 44 89 e2 89 ee e8 94 5c df ff eb b5
[ 736.108664] RSP: 0018:ffffb52c81667878 EFLAGS: 00010246
[ 736.108669] RAX: 0000023b6f69d5e6 RBX: ffff993991873800 RCX: 0000000000000000
[ 736.108672] RDX: 0000000000008bfe RSI: 00000000000080ac RDI: 0000023b6f6949e8
[ 736.108676] RBP: 000000000000000d R08: 0000000000000000 R09: ffffb52c816677f0
[ 736.108678] R10: 0000000000000000 R11: 0000000000000908 R12: 0000000000000000
[ 736.108680] R13: 0000000000000000 R14: ffff993981d34488 R15: ffff993b1e380908
[ 736.108683] FS: 0000000000000000(0000) GS:ffff9947e1a00000(0000) knlGS:0000000000000000
[ 736.108686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 736.108689] CR2: 00005605866c7cc8 CR3: 0000000da8428000 CR4: 0000000000f50ef0
[ 736.108692] PKRU: 55555554
[ 736.108694] Call Trace:
[ 736.108699] <TASK>
[ 736.108701] ? dcn314_smu_send_msg_with_param+0x108/0x1b0 [amdgpu]
[ 736.109157] ? __warn+0x81/0x130
[ 736.109166] ? dcn314_smu_send_msg_with_param+0x108/0x1b0 [amdgpu]
[ 736.109555] ? report_bug+0x16f/0x1a0
[ 736.109566] ? handle_bug+0x3c/0x80
[ 736.109571] ? exc_invalid_op+0x17/0x70
[ 736.109575] ? asm_exc_invalid_op+0x1a/0x20
[ 736.109585] ? dcn314_smu_send_msg_with_param+0x108/0x1b0 [amdgpu]
[ 736.109965] ? dcn314_smu_send_msg_with_param+0xae/0x1b0 [amdgpu]
[ 736.110307] link_set_dpms_off+0xfe/0x9d0 [amdgpu]
[ 736.110710] ? srso_alias_return_thunk+0x5/0xfbef5
[ 736.110716] ? generic_reg_set_ex+0xa8/0xf0 [amdgpu]
[ 736.111088] ? srso_alias_return_thunk+0x5/0xfbef5
[ 736.111091] ? optc31_set_drr+0x128/0x1d0 [amdgpu]
[ 736.111458] dcn31_reset_hw_ctx_wrap+0x218/0x440 [amdgpu]
[ 736.111843] dce110_apply_ctx_to_hw+0x4e/0x320 [amdgpu]
[ 736.112247] dc_commit_state_no_check+0x5f3/0x1910 [amdgpu]
[ 736.112598] dc_commit_streams+0x299/0x580 [amdgpu]
[ 736.112953] ? srso_alias_return_thunk+0x5/0xfbef5
[ 736.112967] dm_suspend+0x214/0x270 [amdgpu]
[ 736.113352] amdgpu_device_ip_suspend_phase1+0x9c/0x1a0 [amdgpu]
[ 736.113614] amdgpu_device_ip_suspend+0x29/0x70 [amdgpu]
[ 736.113913] amdgpu_device_pre_asic_reset+0xcd/0x430 [amdgpu]
[ 736.114165] amdgpu_device_gpu_recover+0x442/0xd00 [amdgpu]
[ 736.114412] ? __drm_err+0x7d/0xa0
[ 736.114421] amdgpu_job_timedout+0x187/0x270 [amdgpu]
[ 736.114769] ? __cancel_work_timer+0x103/0x1a0
[ 736.114778] drm_sched_job_timedout+0x73/0x100 [gpu_sched]
[ 736.114791] process_one_work+0x16f/0x330
[ 736.114797] worker_thread+0x273/0x3c0
[ 736.114804] ? __pfx_worker_thread+0x10/0x10
[ 736.114808] kthread+0xe5/0x120
[ 736.114813] ? __pfx_kthread+0x10/0x10
[ 736.114818] ret_from_fork+0x31/0x50
[ 736.114824] ? __pfx_kthread+0x10/0x10
[ 736.114828] ret_from_fork_asm+0x1b/0x30
[ 736.114837] </TASK>
[ 736.114840] ---[ end trace 0000000000000000 ]---
[ 741.164969] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 741.165339] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 741.311959] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 741.312271] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 741.458918] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 741.459215] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 741.605745] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 741.606038] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 741.752558] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 741.752852] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 741.899463] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 741.899758] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 742.046273] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 742.046580] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 742.193156] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 742.193445] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 742.340028] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 742.340316] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 742.342053] amdgpu 0000:c1:00.0: amdgpu: MODE2 reset
[ 748.164756] amdgpu 0000:c1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
[ 748.164762] amdgpu 0000:c1:00.0: amdgpu: Mode2 reset failed!
[ 748.164766] amdgpu 0000:c1:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:c1:00.0
[ 748.164788] amdgpu 0000:c1:00.0: amdgpu: GPU reset(1) failed
[ 748.164820] [drm] Skip scheduling IBs!
[ 748.164822] amdgpu 0000:c1:00.0: amdgpu: GPU reset end with ret = -62
[ 748.164842] [drm] Skip scheduling IBs!
[ 748.164844] [drm] Skip scheduling IBs!
[ 748.164852] [drm] Skip scheduling IBs!
[ 748.164857] [drm] Skip scheduling IBs!
[ 748.164860] [drm] Skip scheduling IBs!
[ 748.164864] [drm] Skip scheduling IBs!
[ 748.164868] [drm] Skip scheduling IBs!
[ 748.164874] [drm] Skip scheduling IBs!
[ 748.164877] [drm] Skip scheduling IBs!
[ 748.164881] [drm] Skip scheduling IBs!
[ 748.164884] [drm] Skip scheduling IBs!
[ 748.164891] [drm] Skip scheduling IBs!
[ 748.164826] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -62
[ 748.165366] [drm] Skip scheduling IBs!
[ 748.216063] [drm] Skip scheduling IBs!
So basically because of the AMD graphics flakiness it’s still not quite ready to daily drive on Linux
I’ll try booting Windows from an expansion card and see if it’s more stable - that should at least let me know if I’ve just happened to lose the silicon lottery or if amdgpu code is still pretty buggy.
EDIT: some more system/version info:
sudo dnf info amd-gpu-firmware
Last metadata expiration check: 0:32:05 ago on Sat 18 May 2024 11:05:17.
Installed Packages
Name : amd-gpu-firmware
Version : 20240513
Release : 1.fc40
Architecture : noarch
Size : 19 M
Source : linux-firmware-20240513-1.fc40.src.rpm
Repository : @System
From repo : updates
Summary : Firmware for AMD GPUs
URL : http://www.kernel.org/
License : Redistributable, no modification permitted
Description : Firmware for AMD amdgpu and radeon GPUs.
journalctl -b -k --grep amdgpu
May 18 11:19:33 kronk kernel: [drm] amdgpu kernel modesetting enabled.
May 18 11:19:33 kronk kernel: amdgpu: Virtual CRAT table created for CPU
May 18 11:19:33 kronk kernel: amdgpu: Topology: Add CPU node
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Fetched VBIOS from VFCT
May 18 11:19:33 kronk kernel: amdgpu: ATOM BIOS: 113-PHXGENERIC-001
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: vgaarb: deactivate vga console
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: VRAM: 4096M 0x0000008000000000 - 0x00000080FFFFFFFF (4096M used)
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
May 18 11:19:33 kronk kernel: [drm] amdgpu: 4096M of VRAM memory ready
May 18 11:19:33 kronk kernel: [drm] amdgpu: 30033M of GTT memory ready.
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: Will use PSP to load VCN firmware
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: RAS: optional ras ta ucode is not available
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: RAP: optional rap ta ucode is not available
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SMU is initialized successfully!
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
May 18 11:19:33 kronk kernel: amdgpu: HMM registered 4096MB device memory
May 18 11:19:33 kronk kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
May 18 11:19:33 kronk kernel: kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
May 18 11:19:33 kronk kernel: amdgpu: Virtual CRAT table created for GPU
May 18 11:19:33 kronk kernel: amdgpu: Topology: Add dGPU node [0x15bf:0x1002]
May 18 11:19:33 kronk kernel: kfd kfd: amdgpu: added device 1002:15bf
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 6, active_cu_number 12
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
May 18 11:19:33 kronk kernel: [drm] Initialized amdgpu 3.57.0 20150101 for 0000:c1:00.0 on minor 1
May 18 11:19:33 kronk kernel: fbcon: amdgpudrmfb (fb0) is primary device
May 18 11:19:33 kronk kernel: amdgpu 0000:c1:00.0: [drm] fb0: amdgpudrmfb frame buffer device
May 18 11:19:36 kronk kernel: snd_hda_intel 0000:c1:00.1: bound 0000:c1:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])