Framework 13 AMD crashing on different OSs

voltaaa · May 8, 2024, 10:42am

Hey everyone,
I own a Framework 13 AMD for 2 months (and I loved it so far…) which two weeks ago started crashing randomly (multiple times a day) while using it.

Specs:
AMD Ryzen 7 7840U
DDR5-5600 - 64GB (2 x 32GB)
WD_BLACK SN850X NVMe - M.2 2280 - 2TB

OSs:
I have Windows 10, Fedora 40 KDE and Pop! OS 22.04 installed in parallel and the crashes happen on all of them.

Symptoms:
Generally the OSs work fine, except the usual suspend issues on Linux (sometimes not waking up but rebooting instead or black screen after waking up).
When the OSs crash it is mostly happening while watching videos or gaming. Temperatures are completely fine (below 60°C, I most of the times run the energy saving power profiles in the OSs anyway, even when being connected to power, no fans going crazy or similar).

Crashing on Windows means:

flickering screen with artifacts (like for a faulty graphics driver)
instant reboot
no event logs for the crash except “kernel power 41”, no warnings beforehand or similar

Crashing on Linux is consistent between Pop and Fedora and means:

Screen goes black, sometimes chopped audio artifacts
instant reboot
no indicator in journalctl, nothing suspicious in the last log entries before the crashes happen

What I did already:

running memtester multiple times, tested 50GB of the 64GB RAM, the rest was allocated already and no issues were found
installing latest drivers for the OSs, performing firmware/chipset and BIOS update, issue still present

Any ideas how I can narrow it down? I fear it’s a hardware issue but I’m wondering how I would debug it properly.

Many thanks in advance!

Chad_Nelson · May 8, 2024, 10:58am

Welcome to the community!

It does sound like a hardware issue, but it may well be a correctable one. I would suggest trying the method outlined here first; if that fails, scroll up a little to see my original post on how to force it. That technique seems to solve a number of hardware-related problems, large and small.

If the problem persists, you’ll likely need to talk to Framework’s support team.

Good luck! And please let us know how/whether that works for you.

Jaroslaw_Fedewicz · May 8, 2024, 11:00am

Interesting, I have a non-Framework laptop using the same AMD platform, and I’m observing the same kind of problems. RAM is ok as tested by memtest86+, nothing in the logs, just insta-reboot, vaguely reminiscent of what happens when some kind of overvoltage/overcurrent kicks in.

My system is GPD Win with a 7840HS inside, so yeah, but I’m just hoping against all hope the fix for the Framework could apply to my system, too.

I, however, have been so far unable to pin this on any particular graphics usage or CPU load (mine can reboot even when idle).

The only very finicky idea that’s on my mind right now is to set up continuous monitoring of all possible sensors and immediately sending them over the network so that hopefully a spike in temperature/voltage can be seen as close to the crash as possible.

I have tried setting up netconsole for the Linux kernel, but there wasn’t anything useful in the kernel log at the time of reboot. No warnings, no nothing. So very likely a firmware/hardware level protection kicking in.

voltaaa · May 18, 2024, 7:53am

Thanks to both of you for your input and apologies for the late response.

Thanks, Chad! Re-inserting the extension cards as described in your linked post seems to have helped. I did not see a single issue since then. While I did not use the laptop super heavily in the last week due to being on holiday the crashes/freezes did not happen again so far. I’ll push the machine a bit more in the next weeks and keep an eye on how it behaves. But I’m hopeful this might have been it.

Jaroslaw, sorry, that this is probably not helping you with your issue. I hope you get it sorted out.

Thanks again for your support and keep up being an awesome community.

McAllaster · July 9, 2024, 6:53pm

Running into the same issue (instant reboot) with my FW13 7840U 64GB RAM, journalctl has no indication of a crash. Running openSUSE Tumbleweed using the latest firmware/bios.

I’m curious if this is load-dependent. I was having this issue on Gnome frequently until I disabled file indexing. However, with KDE I get it more often (even with file indexing turned off) which has caused me to start looking for solutions.

I assume it’s also worth noting that I’m almost always running my laptop docked to a Caldigit TB4 with a high refresh rate 4k display connected, so obviously there’s a bit more stress involved than with the laptop undocked.

Trebor_Redins · August 5, 2024, 8:54pm

Wondering if there are any new ideas about what might be causing this. I have the same issue that I believe only happens when I am plugged into an Aorus gaming box eGPU.

voltaaa · August 6, 2024, 9:49am

Hey everyone, quick update from me:

The instant reboot issues were gone for about a week and then returned. I needed to reach out to FW support and meanwhile got my mainboard and CPU replaced.
It was not the greatest customer experience since it took me weeks to convince them that it is a hardware issue but eventually they sent me replacement parts and after replacing the board I - so far - did not have a single instant reboot in 1 1/2 months.

Running dual boot with Fedora 40 Plasma and Win 11 right now (with other distros/OSs you will not get support at all) and it finally feels like I’m having a reliable machine in hands.
I hope you guys get your issues sorted out.

Paul_Brown · October 25, 2024, 11:31pm

I’ve been dealing with a very similar issue on my Framework 13 AMD and wanted to share my experience so far. I’ve been going in circles with support, and it’s been incredibly frustrating. Here’s a rundown of my specs, symptoms, and only some of the things I’ve tried:

My Setup:

Processor: AMD Ryzen 7 7840U
RAM: 64GB DDR5-5600 (tested with both official Framework RAM and third-party modules)
OS: Ubuntu 24.04 (clean stock install, also had the issue with 22.04)

Symptoms:

Random Restarts: The laptop will suddenly restart without warning, sometimes under load like video playback and sometimes under minor load like web browsing, with no clear error logs or journalctl entries before the crash. Sometimes these crashes happen after days of normal usage. Sometimes it happens multiple times a day.
No Prior Issues on Intel: I previously used an Intel mainboard without any stability problems, so this seems specific to the AMD setup.

What I’ve Tried:

RAM Testing: I’ve done extensive testing with memtest86+, trying multiple configurations:

Tested both third-party and official Framework RAM.
Swapped RAM sticks between slots, used single sticks, and ran memtest86+ with each configuration. All tests passed without errors.

Firmware, BIOS, and Driver Updates: I’ve made sure everything is up-to-date:

Updated to the latest BIOS version recommended for AMD.
Installed the latest kernel and AMD-specific drivers for Ubuntu.

USB-C and Dock Testing:

Tried multiple USB-C docks and hubs (Anker 555, Dell D6000, Vava, Caldigit TS3 Plus).
Tested with and without external monitors attached, and even with no peripherals connected.
Crashes still happen randomly, often while docked, but also at times with no external connections.

Testing with Different Chargers:

Switched between a 96W MacBook charger and a 61W model to see if charging affected stability, but the crashes continued on both.

Miscellaneous Troubleshooting:

I tried running with the bezel removed to rule out display issues.
I also tried different USB-C expansion slots, as well as charging from various ports to see if it changed the crash frequency.

Framework Support suggested using a Live USB environment for extended testing to rule out software issues, but since this is my only laptop, it’s not practical to operate from a Live CD for multiple days. This seems like overkill since I’ve already done a fresh install. Given the level of testing I’ve already completed, I’m becoming more convinced this is a hardware or electrical issue with the AMD mainboard.

This issue has caused me to cancel my preorders for the upgraded display and a framework 16. I can no longer recommend getting an AMD framework laptop, this is by far the most unstable laptop I’ve ever had.

Andrew_B · October 28, 2024, 12:51am

This sounds identically to what I’m going through. I’ve tried so many configurations of USB peripherals but doesn’t seem to make a difference; any type of charge massively increases the chance of a random crash. RAM sticks, SSD, WiFi card and expansion cards all isolated from the issue. I was able to quickly replicate the issue running the Fedora 40 live USB.

Which batch did you get yours from? Seems like there’s a number of these popping up this month, I’m also convinced there’s a hardware/electrical fault in the mainboard and likely from the September batch.

Paul_Brown · October 28, 2024, 1:10am

Mine’s from Batch 2 and shipped late October of last year. Framework support has now asked me to provide more logs, current system details, and to reset the BIOS to defaults again. I’ve been dealing with this for about a year, and while the laptop had a lot more issues initially (like display artifacting with hard freeze crashes), some of those were actually resolved with BIOS or firmware updates over time. I put up with the random crashes for a while, hoping they’d eventually be fixed too, but I’ve recently lost work from these sudden restarts, and I’m starting to lose patience.

I’ve been a big supporter of Framework and have already upgraded my mainboard twice, but this AMD mainboard has been really unstable from the start. At this point, I’m also convinced this is a hardware or electrical issue specific to the board, and it’s not something I can troubleshoot much further on my end.

Let’s hope we can get some clarity from Framework soon—please keep me posted on any progress on your end!

Andrew_B · October 28, 2024, 1:26am

Damn, it’s concerning that you have a much older AMD unit with the same problem; I was hoping it might just be a bad batch. I’ll have to seriously consider swapping to Intel in the hopes of improved stability.

lbkNhubert · October 28, 2024, 2:09am

It feels like they should swap out the mainboard at this point.

olenananas · October 28, 2024, 11:47am

Same(-ish?) thing happening to me. Seems like it’s more likely to happen under load, but I’ve had it happen before OS has even loaded as well, Graphics glitches on the screen and total freeze, any audio that was playing repeats probably the last buffer sent to the device.

olenananas · October 28, 2024, 1:34pm

After running memtest86+ for multiple loops, it seems like there are some issues.

Interestingly, it doesn’t fail at same addresses or nearly every test loop. And it very rarely fails on the first loop. The errors seem to come at batches, and at worst, I got over 300+ errors.

I am going to try and swap my two RAM sticks around and run the memtest overnight.

Isaac_Harper · October 28, 2024, 2:05pm

Dealing with this issue also,

Support eventually replaced 1 stick of RAM they decided was bad, on the first boot I had a crash in 20 minutes.

I reached back out and haven’t head back in a few days, wondering what will be tried next.

Mine is also an earlier batch

olenananas · October 28, 2024, 2:27pm

Ok, swapped the two RAM sticks around, still get the same addresses for errors, usually in the 0x000700000000-0x0007FFFFFFFF range.

Starting to look like a motherboard issue to me.

Adrian_Joachim · October 28, 2024, 2:43pm

Have you tried cleaning the contacts on the dimm with alcohol? DDR5 (especially as sodimm) is extremely sensitive to clean contacts.

James3 · October 28, 2024, 2:44pm

@olenananas
I don’t know how AMD arranges RAM addresses vs RAM chips.
It might be interleaved. I.e. byte 0 on RAM chip1, byte 1 on RAM chip2, byte 2 on RAM chip1, byte 3 on RAM chip2.
I Believe ECC is used at L1 Cache, L2 Cache, L3 Cache, on the DDR5 RAM chip.
I don’t believe ECC is used on the data lines between the CPU and the RAM chip.
In which case, it might be useful to find out how AMD arrange the RAM addresses, and then look in the errors for the offsets of the mismatching data, and work out from their which RAM chip has the problem, or if like you said, the mismatch does not move with the RAM chip, so must therefore be motherboard based problem.
I have always thought the ECC that also covers the data lines between chips is the way to go, but Intel and AMD only use that on Servers as far as I know.
The cause might also be RF interference. Do you have any mobile phones or washing machines or other electrical items near the laptop when doing the memtest ?

I found out how the AMD 7840 maps physical address to RAM chip. One RAM chip is channel A, the other is Channel B. The LSB bit of the 64 bit address selects the Channel A/B.
So, it is actually as I describe above. Byte 0 → Channel A, Byte 1 → Channel B etc.

Looking at the data from the pics you posted:
1e21 679d 38c6 5c6b < Expected
e121 679d c7c6 5c6b < Found

0741 6ba0 6cf4 4a2c < Expected
f841 6ba0 93f4 4a2c < Found

cc7b c3f8 0b75 4987 < Expected
337b c3f8 f475 4987 < Found

dcb4 2aee 0ad6 c2eb < Expected
23b4 2aee f5d6 c2eb < Found

The errors are all when LSB=0, and OK when LSB=1. See numbers in bold for the errors.
So, this points to a fault on one of the DDR5 RAM channels.
So, if you swap the RAM chips, the errors should move from being on LSB=0 to bad on LSB=1.
But, looking at some values further down, the LSB=1 are bad.
E.g.
Via CPU 3 == LSB=0 bad
Via CPU 9 == LSB=1 bad.

There is not a big enough sample size there though.
I therefore suspect a Motherboard change is needed, as it might even be the CPU having the bug, but the CPU is soldered on the MB.

olenananas · October 28, 2024, 3:12pm

I’m still going to run through the memtests with either single chip installed in both slots, that should give enough info to see if this actually is a mobo problem or if one of the sticks is faulty (which seems a bit unlikely as I’d expect far more consistent failures in that case).

And frankly, if running electrical items near FW laptop blows up memtest, the laptop (or any laptop where that is enough to cause memory failures) is not fit for market, and I’m pretty sure Framework engineers would agree.

Adrian_Joachim · October 28, 2024, 3:35pm

Solid approach.

It’s less that running them disturbs a properly working one but that it causes a marginal one to fail.

I am really looking forward to lpcamm, ddr5 sodimms are running pretty close to the limit even if they are working, those traces are just too damn long.