Hibernate image mismatch: arch specific data

tldr:

[   11.210276] PM: Loading and decompressing image data (3778097 pages)...
[   11.210577] Hibernate inconsistent memory map detected!
[   11.210862] PM: hibernation: Image mismatch: architecture specific data
[   11.211144] PM: hibernation: Read 15112388 kbytes in 0.01 seconds (1511238.80 MB/s)
[   11.213154] PM: Error -1 resuming
[   11.213158] PM: hibernation: Failed to load image, recovering.
[   11.214414] PM: hibernation: Basic memory bitmaps freed
[   11.214796] OOM killer enabled.
[   11.215309] Restarting tasks: Starting
[   11.215948] Restarting tasks: Done
[   11.216429] PM: hibernation: resume failed (-1)

Recently I’ve started experiencing these faults. Only 1 out of 5 hibernations resumes successfully.

When hibernating, my laptop was running from battery with all periphery detached (external monitor, USB hub and other devices).

AMD Ryzen AI 9 HX 370 w/ Radeon 890M
6.19.8-061908-generic
Ubuntu 24.04.4 LTS

Anyone else experienced similar things?

Looks like it has something to do with the recent kernel.

I downgraded to 6.18.20-061820-generic and last several hiberation cycles were all fine.

The only issue I am experiencing right now is that sometimes kernel refuses to hibernate stating that there are no enough free pages. That’s strange since my swap partition is large enough to fit all the RAM and usually only ~40% of the RAM is occupied.

Update: Looks like the original issue was definitely caused by recent kernel. After I downgraded I was able to hibernate a dozen times or so and everything went OK. With the exception of failed attempts like this:

апр 02 15:58:41 fw13 kernel: PM: hibernation: Creating image:
апр 02 15:58:41 fw13 kernel: PM: hibernation: Need to copy 6767988 pages
апр 02 15:58:41 fw13 kernel: PM: hibernation: Normal pages needed: 6767988 + 1024, available pages: 5758682
апр 02 15:58:41 fw13 kernel: PM: hibernation: Not enough free memory
апр 02 15:58:41 fw13 kernel: PM: hibernation: Error -12 creating image

Most of the times vm cache flush with subsequent hibernate worked… until today.

Last time the hibernation attempt failed, but the kernel failed to recover from it with huge cascade failure in amdgpu. After it the laptop was running kinda OK (I heard notification sounds), but the screen was dead and I wasn’t able to reboot it gracefully.

апр 02 15:58:41 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: MES failed to respond to msg=MISC
апр 02 15:58:41 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to change_config.
апр 02 15:58:41 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: resume of IP block <gfx_v11_0> failed -110
апр 02 15:58:41 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
апр 02 15:58:41 fw13 kernel: amdgpu 0000:c1:00.0: PM: dpm_run_callback(): pci_pm_thaw returns -110
апр 02 15:58:41 fw13 kernel: amdgpu 0000:c1:00.0: PM: failed to recover async: error -110
апр 02 15:58:51 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: ring sdma0 timeout, signaled seq=3718249, emitted seq=>
апр 02 15:58:51 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: Starting sdma0 ring reset
апр 02 15:58:51 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: reset sdma queue (0:0:0)
апр 02 15:58:51 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to wait on sdma queue reset done
апр 02 15:58:51 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: failed to reset legacy queue
апр 02 15:58:51 fw13 kernel: amdgpu 0000:c1:00.0: amdgpu: Ring sdma0 reset failed
апр 02 15:58:51 fw13 kernel: Call Trace:
апр 02 15:58:51 fw13 kernel:  <TASK>
апр 02 15:58:51 fw13 kernel:  ? cancel_delayed_work_sync+0x4a/0x80
апр 02 15:58:51 fw13 kernel:  gfx_v11_0_hw_fini+0x2f/0x150 [amdgpu]
апр 02 15:58:51 fw13 kernel:  gfx_v11_0_suspend+0xe/0x20 [amdgpu]
апр 02 15:58:51 fw13 kernel:  amdgpu_ip_block_suspend+0x27/0x60 [amdgpu]
апр 02 15:58:51 fw13 kernel:  amdgpu_device_ip_suspend_phase2+0x1bc/0x3c0 [amdgpu]
апр 02 15:58:51 fw13 kernel:  amdgpu_device_ip_suspend+0x4c/0x90 [amdgpu]
апр 02 15:58:51 fw13 kernel:  amdgpu_device_pre_asic_reset+0xf4/0x540 [amdgpu]
апр 02 15:58:51 fw13 kernel:  amdgpu_device_asic_reset+0x45/0x1fa [amdgpu]
апр 02 15:58:51 fw13 kernel:  amdgpu_device_gpu_recover.cold+0x241/0x2d6 [amdgpu]

@Mario_Limonciello maybe the above would be interesting to you. The whole kernel log is here.

If it’s reproducible would you be able to bisect? It would go a long way to solving it.

LLMs are really good for helping automated bisects too.

1 Like

I definitely experienced it several times over the past months, but I’m afraid it’s too hard to force it into the crash. Last time it took ~10 cycles.

That’s part of why I think an LLM might be able to shine. You can tell it to do this overnight and try 10 times at each commit. It can schedule a wake using wakeup timer.

Then it’s not hands on at all for you. Just set it and let it chug overnight, and if it’s not done by morning do it again the next night.