Random hard freezes fw13 amd7840u win11

DavidL · August 5, 2024, 3:38am

My FW13 i5 had several freezes on Saturday morning which provided an opportunity to make some systematic observations of the problem, with these results.

As mentioned previously, freezes tend to occur in episodes separated by problem-free periods. On Saturday mid-morning random screen responses became delayed by sub-second to maybe 2-sec “freezes” and the system eventually froze completely after a minute or so, recovered after ~15 minutes, and froze again.

That has all the characteristics of electrical noise, maybe causing spurious interrupts.

The system always does recover as M_R stated. However only the integrated monitor was unresponsive. The time-of-day clock apparently continued to be updated correctly (it showed the correct time when the screen came back to life after 15 mins), the screen darkened and increased brightness when the outboard mouse or the trackpad was moved during this time, and the spreadsheet in use beforehand continued to run normally afterwards. Nudging the mouse during a freeze moved the cursor when the monitor recovered.

The FW13 had an unused HDMI adapter, so the empty HDMI cable socket was electrically floating. Taking a hint from the How-To-Geek article linked above, I removed it and rebooted.

So far, two full days later, no problems. If it runs for a month without problems I’ll declare problem solved.

But the floating adapter connector doesn’t seem like a good idea, either electrically or since it’s exposed to dust on desk surfaces, etc. Is it possible to buy a dummy cover?

DavidL · August 5, 2024, 8:56am

Yes, that’s definitely on the list and it has certainly worked for some users. However the highly episodic nature of the problem suggests to me there’s an underlying cause which the updated firmware may handle better.

Brian_Gregory · August 5, 2024, 9:49pm

You make it sound like you may not even have installed the Framework provided BIOS and drivers:

Josh · August 6, 2024, 12:21am

thanks you are right I did not download these

DavidL · August 8, 2024, 1:24am

I had a single freeze last night after 4 1/2 days, so I’ve reinstated the HDMI adapter and will update the BIOS ASAP.

But it would be nice to know why updating the BIOS seems to fix the problem rather than just doing it and hoping for the best! This still doesn’t feel to me like a common-or-garden program bug because it’s too randomly episodic.

The 3.05 update notes four fixes, only one of which (the thermal issue with Linux) could possibly be environmental or hardware related and it’s winter here. However the retimer update might do the trick: it could account for the fact it’s only the integrated display which freezes, and freezes evidently happen much more frequently on faster Ryzen FWs where I guess signal timing is more critical.

Two questions though…

The update notes show two links to the same Linux BIOS 3.05 update but one carries the comment “You must be running 3.05 or later to apply this update using EFI.” which is obviously curcular. What is really meant?

And am I correct in thinking that the Linux BIOS update includes the shell? Or is the shell provided by the O/S like the drivers?

h91 · August 27, 2024, 8:46pm

What’s the general stability of the 3.05 bios? I’m still running 3.03b, and the experience of previous bluescreens has made me reluctant to update since the system is stable now.

Robin_How · August 29, 2024, 10:32pm

I have had zero lock up issues for months since being on 3.05.

Alexander_Johnson · October 18, 2024, 6:29pm

Is anyone else still having these issues? I continue to get the lagging cursor to eventually fully hard locking up. It has blue screened in the past with the DPC WATCHDOG VIOLATION error but lately the freezes are simply hardlocks without it ever moving to a BSOD.

Memtest86 passes, reinstalling windows, drivers, swapping RAM around, with and without expansion cards, etc. Nothing seems to solve it.

I’m still discussing this with support. It’s frustrating how unreliable the laptop has been.

Brian_Gregory · October 18, 2024, 9:02pm

And you’re definitely on BIOS 3.05 ?

Alexander_Johnson · October 18, 2024, 9:15pm

Yes, updated to that as soon as I got the laptop back in May. During my ongoing troubleshooting thread with support I verified it’s on 3.05 as well.

Geektime · October 23, 2024, 1:09pm

I recently had a serious event. I was working as usual with the laptop hooked up to an external monitor, bluetooth keyboard and mouse attached through the external monitor’s usb hub. I had a dozen or so PDFs open, three or four firefox tabs open, Word open. Computer hard froze. It happens, although nothing now for many months. I hadn’t updated the AMD drivers to the latest set released earlier this month, but had the April, 2024, BIOS 3.05 update installed.

So I pressed the power button to restart and the Windows recovery environment opened up. I then proceeded for the next few hours to attempt the usual Windows 11 recovery steps and my Bitlocker recovery code, but to no avail. In truth, I had turned off the restore points to conserve drive space.

I then went to the reset stage, using a USB key created on my desktop computer. But a reset saving my personal files was not even possible. It would attempt to reset, but get to 1% of the job and then stop, indicate that the changes were being undone, and I would be returned to the recovery environment. I worked with a friend who is an IT person at Microsoft and we weren’t able to resolve why this reset option wasn’t available.

So I had to do a clean install of Windows. Not the end of the world, as I have a NAS backup, but of course worrisome.

I’ve since turned on the restore point for the laptop, and everything is working well since the fresh install.

I’d appreciate any thoughts. Thanks.

guitarnono · October 24, 2024, 10:51am

Hello, a lot of BSOD here (CLOCK_WATCHDOG_TIMEOUT and others), since the end-september/early-october Windows 11 updates.
I keep a precious save that was ok, in the case the next updates won’t fix the issue…

AkechiShiro · October 26, 2024, 1:34pm

Hello,

Under NixOS Linux 6.11.5 (AMD 7840U) and I’ve had two issues, had a hard freeze and checked dmesg saw this :

[37693.948431] clocksource: timekeeping watchdog on CPU8: Marking clocksource 'tsc' as unstable because the skew is too large:
[37693.948457] clocksource:                       'hpet' wd_nsec: 503361740 wd_now: 5bbcd559 wd_last: 5b4edc21 mask: ffffffff
[37693.948466] clocksource:                       'tsc' cs_nsec: 503904732 cs_now: 70f2521498fa cs_last: 70f1ef26ae21 mask: ffffffffffffffff
[37693.948472] clocksource:                       Clocksource 'tsc' skewed 542992 ns (0 ms) over watchdog 'hpet' interval of 503361740 ns (503 ms)
[37693.948479] clocksource:                       'tsc' is current clocksource.
[37693.948536] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[37693.948842] clocksource: Checking clocksource tsc synchronization from CPU 8 to CPUs 0,6,9,12-15.

I’ve also encountered an amdgpu bug I think under 6.11.3 which made the laptop extremely slow :

[385344.422201] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[385345.259430] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[385345.521140] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[385345.798798] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
[385346.057242] amdgpu 0000:c1:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data

I tried to suspend the laptop and resume but this did not fix the issue under 6.11.3 but I saw these additional lines

[385554.065285] PM: suspend entry (s2idle)
[385554.072972] Filesystems sync: 0.007 seconds
[385554.098957] Freezing user space processes
[385554.101760] Freezing user space processes completed (elapsed 0.002 seconds)
[385554.101765] OOM killer disabled.
[385554.101766] Freezing remaining freezable tasks
[385554.102918] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[385554.102924] printk: Suspending console(s) (use no_console_suspend to debug)
[385555.144924] queueing ieee80211 work while going to suspend
[385557.028247] ACPI: EC: interrupt blocked
[385596.855897] amd_pmc AMDI0009:00: Last suspend didn't reach deepest state
[385596.927936] ACPI: EC: interrupt unblocked
[385596.979090] clocksource: timekeeping watchdog on CPU11: hpet wd-wd read-back delay of 260019ns
[385596.979100] clocksource: wd-tsc-wd read-back delay of 3732876ns, clock-skew test skipped!
[385597.129649] [drm] PCIE GART of 512M enabled (table at 0x00000080FFD00000).
[385597.129847] amdgpu 0000:c1:00.0: amdgpu: SMU is resuming...
[385597.133052] amdgpu 0000:c1:00.0: amdgpu: SMU is resumed successfully!
[385597.140000] nvme nvme0: D3 entry latency set to 10 seconds
[385597.143115] nvme nvme0: 16/0/0 default/read/poll queues
[385599.763083] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(5) dpia(0) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385600.026274] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(5) dpia(0) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385600.289362] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(6) dpia(1) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385600.551335] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(6) dpia(1) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385600.814635] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(7) dpia(2) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385601.077735] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(7) dpia(2) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385601.341270] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(8) dpia(3) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385601.604714] amdgpu 0000:c1:00.0: [drm] *ERROR* dpia_query_hpd_status: for link(8) dpia(3) failed with status(0), current_hpd_status(0) new_hpd_status(0)
[385608.993905] amdgpu 0000:c1:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[385608.993914] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[385608.993919] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[385608.993922] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[385608.993926] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[385608.993929] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[385608.993932] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[385608.993935] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[385608.993939] amdgpu 0000:c1:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[385608.993942] amdgpu 0000:c1:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[385608.993945] amdgpu 0000:c1:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[385608.993949] amdgpu 0000:c1:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[385608.993952] amdgpu 0000:c1:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[385609.269013] [drm] ring gfx_32788.1.1 was added
[385609.269649] [drm] ring compute_32788.2.2 was added
[385609.270274] [drm] ring sdma_32788.3.3 was added
[385609.270302] [drm] ring gfx_32788.1.1 ib test pass
[385609.270330] [drm] ring compute_32788.2.2 ib test pass
[385609.270435] [drm] ring sdma_32788.3.3 ib test pass
[385609.803325] OOM killer enabled.
[385609.803327] Restarting tasks ... done.
[385609.807725] random: crng reseeded on system resumption
[385610.410274] PM: suspend exit

I wasn’t able to reproduce the issue on 6.11.5 yet.

Would you have any idea what could this be due to @Mario_Limonciello ?

Mario_Limonciello · October 27, 2024, 12:49pm

All 3 sound like symptoms of some sort of firmware problem. If you’re able to reproduce you should report it to framework.

_ce · November 7, 2024, 3:22pm

I’m having the same problem for my Framework 13" that was delivered in Sep 2024.

Framework Laptop 13 DIY Edition (AMD Ryzen™ 7040 Series)
System: AMD Ryzen™ 5 7640U - 2.8k Display

Depending on the apps I use, I’m having daily complete lockups on Windows 11 with latest BIOS and drivers when using some graphics intensive apps including Microsoft Powerpoint, Adobe Photoshop. The lockups mostly if not always seem to happen when doing sudden intensive tasks but not during sustained heavy tasks, specifically saving a (large) file or copy pasting (large pictures) seem to trigger a lockup.

System freezes completely without any chance to recover, display stays on. After a minute or so, the system will automatically reboot. The system event log doesn’t contain any entries and there is no BSOD to be seen.

I’ve already removed the HDMI extension and disabled the PCIe idle setting in the BIOS to make sure this is not related to either of those.

Fraoch · November 25, 2024, 2:55pm

A post was split to a new topic: Touchpad gestures

Kaukov · December 1, 2024, 10:37pm

I’ve been having random power cuts and restarts the past 2 months and they’re super random. However I’ve narrowed it down to iGPU memory usage mostly. And all issues started appearing after FW replaced my 7840U mainboard under warranty. My old one was manufactured in 2023, my new one - in 2024.

I’m using the Crucial 96GB DDR5 kit at 5600MHz and it worked flawlessly with the old motherboard.

I’ve also tested multiple OS - Windows 11 Pro, ArchLinux, Fedora 40 and 41 (Gnome and KDE), Gentoo, NixOS. All of them experience a complete power off and reboot of the machine randomly. When I set the iGPU memory to Gaming in the BIOS, the crashes happen more frequently.

And now with the most recent Linux kernel 6.12 the issue also happens super frequently. I ran a few memtests (except the super slow ones) and everything was fine. I also reseated the RAM a couple of times - still the same thing.

And now that I’m reading about everyone else’s issues, I’m both happy and sad I’m not the only one.

I really hope a firmware or software update can fix the issue and not another mainboard replacement.

Kaukov · December 7, 2024, 8:41pm

Update:
The second module had dirty connectors (idk how that happened). After cleaning it, the laptop booted without issues. I’ll now test it for a few days and see if it crashes. If not, I’ll test with both modules again and hope everything will be fine.

I can’t edit my comment so I’m replying instead.

After digging deeper and testing the RAM again, it might be because of a faulty module. I started testing the modules on by one in each slot. The first module worked perfect in both slots. Then, after inserting the second module in the first slot, the system doesn’t boot and I get a POST code - 11000101 → 0xC5 → Restore system configuration stage 1

I have no idea what it actually means, but the system won’t boot even after more than 30 minutes left in this state.

I’ve contacted Crucial and will be waiting for a response next week. So most probably not Framework’s fault.

The laptop is also stable with the single working module - I haven’t had any crashes or freezes.

Kaukov · December 17, 2024, 8:09pm

Last update:

It seems the issue only occurs when I have both memory sticks inserted. If only one stick is slotted in any of the slots the system is stable, but I get way less FPS in games. When both sticks are present, I get decent FPS, but the laptop shuts down in less than 5 minutes from launching the game.

I created a support ticket a month or more ago and still haven’t received a response so I might create a new one.

But both sticks run flawlessly on their own, MemTest86+ passes with flying colors. Even both sticks pass all MemTest86+ tests, but fail in real-world scenarios.

I’m now certain it’s a mainboard issue.

_ce · December 19, 2024, 11:25pm

Still seeing occassional crashes again, though due to different daily use of my laptop in the last month (mostly coding, compiling, I’m guessing not so heavy GPU stuff) its been about a month since the last crash.

I’ve noticed that my last crash in Nov as well as the last 2 crashes actually created a minidump indicating a DPC_WATCHDOG_VIOLATION. That doesn’t tell much but it does indicate a hanging driver for whatever reason.

The last 2 minidump seemed to involve winhvr.sys (WSL was running both times) but the Nov dump involved mmcss.sys. I’ve been adviced to “RAM shuffle” my RAM modules, which I’ll try in any case.

For what its worth: memory configuration is “Kingston DDR5 SODIMM FURY Impact 2x16GB 5600”, SSD is “WD Black SN770 1TB M.2 SSD”, driver is the latest pack from Framework (Unified AMD Ryzen 7040 Series Driver Bundle 2024-10-02), BIOS is 3.05 and Link State Power Management in advanced power options is set to Off.