Objectionable GPU reset carshes desktop envireonment

distro: NixOS 24.11 “stable”, up to date
kernel: 6.6.85
model: AMD Ryzen™ 7040 Series
bios: 0.0.3.5

screen blackened, session restarts to login and all application data not saved was lost

action taken: update BIOS to 0.0.3.7 with fwupdmgr, restart, report.
reasoning: creating record to compare a possible future event for myself or others

$ jornalctl -k
Apr 09 10:56:22 nixos kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=5471474, emitted seq=5471476                                        
Apr 09 10:56:22 nixos kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process .firefox-wrappe pid 119869 thread .firefox-w:cs0 pid 119982                 
Apr 09 10:56:22 nixos kernel: amdgpu 0000:c1:00.0: amdgpu: GPU reset begin!                                                                                                       
Apr 09 10:56:22 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3                                            
Apr 09 10:56:22 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue                                                                   
Apr 09 10:56:22 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3                                            
Apr 09 10:56:22 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue                                                                   
Apr 09 10:56:22 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3                                            
Apr 09 10:56:22 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue                                                                   
Apr 09 10:56:23 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3                                            
Apr 09 10:56:23 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 09 10:56:23 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 09 10:56:23 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 09 10:56:23 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 09 10:56:23 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 09 10:56:23 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 09 10:56:23 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 09 10:56:23 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 09 10:56:23 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 09 10:56:23 nixos kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
Apr 09 10:56:23 nixos kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
Apr 09 10:56:23 nixos kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx

still no progress, cannot simply reproduce the issue. also got other symptoms such as system reboot, system completely freezing after some time of usage or after hibernate.

I did contact support. They wanted me to test on live Ubuntu 24.04 LTS; a rather poor advice in my opinion. I did that and to no surprise could not simply reproduce, even though I tried using it today.

next after I found some other people had similar issues with AMD models:

  • switched kernel to 6.12.23 with boot.kernelPackages = pkgs.linuxPackages_6_12;– this is the earliest version that was not throwing error “was removed because it reached its end of life upstream” and pkgs.linuxPackages_latest would currently result in 6.14.2.
  • changed BIOS-setting Memory Optimizer to Gaming
$ nix-shell -p glxinfo
[nix-shell:~]$ sudo glxinfo | grep -E -i 'device|memory'
    Device: AMD Radeon 780M (radeonsi, gfx1103_r1, LLVM 18.1.8, DRM 3.61, 6.12.23) (0x15bf)
    Video memory: 8192MB
    Unified memory: no
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 6407 MB, largest block: 6407 MB
    VBO free aux. memory - total: 43984 MB, largest block: 43984 MB
    Texture free memory - total: 6407 MB, largest block: 6407 MB
    Texture free aux. memory - total: 43984 MB, largest block: 43984 MB
    Renderbuffer free memory - total: 6407 MB, largest block: 6407 MB
    Renderbuffer free aux. memory - total: 43984 MB, largest block: 43984 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 8192 MB
    Total available memory: 52342 MB
    Currently available dedicated video memory: 6407 MB
    GL_AMD_multi_draw_indirect, GL_AMD_pinned_memory, 
    GL_EXT_framebuffer_sRGB, GL_EXT_memory_object, GL_EXT_memory_object_fd, 
    GL_NVX_gpu_memory_info, GL_NV_alpha_to_coverage_dither_control, 
    GL_AMD_pinned_memory, GL_AMD_query_buffer_object, 
    GL_EXT_gpu_program_parameters, GL_EXT_gpu_shader4, GL_EXT_memory_object, 
    GL_EXT_memory_object_fd, GL_EXT_multi_draw_arrays, 
    GL_MESA_window_pos, GL_NVX_gpu_memory_info, GL_NV_ES1_1_compatibility, 
    GL_EXT_instanced_arrays, GL_EXT_map_buffer_range, GL_EXT_memory_object, 
    GL_EXT_memory_object_fd, GL_EXT_multi_draw_arrays, 

Probably not related to this case and I even got no dmesg:

KDE Neon, Kernel 6.11
Four time now in the last few weeks, after having the laptop run for a long time with a lot of hibernation, suddenly the “fps” of the desktop environment is down to 5 or something, it feels like very sluggish games. Reading about your gpu entirely crashing it might be related to that. As you, I cannot really replicate the issue. The intervals of it happening were differently. I am unsure if xorg falls back to some kind of software rendering which would explain the sluggish behavior