Framework 13 AMD crashing on different OSs

Paul_Brown · October 28, 2024, 1:10am

Mine’s from Batch 2 and shipped late October of last year. Framework support has now asked me to provide more logs, current system details, and to reset the BIOS to defaults again. I’ve been dealing with this for about a year, and while the laptop had a lot more issues initially (like display artifacting with hard freeze crashes), some of those were actually resolved with BIOS or firmware updates over time. I put up with the random crashes for a while, hoping they’d eventually be fixed too, but I’ve recently lost work from these sudden restarts, and I’m starting to lose patience.

I’ve been a big supporter of Framework and have already upgraded my mainboard twice, but this AMD mainboard has been really unstable from the start. At this point, I’m also convinced this is a hardware or electrical issue specific to the board, and it’s not something I can troubleshoot much further on my end.

Let’s hope we can get some clarity from Framework soon—please keep me posted on any progress on your end!

Andrew_B · October 28, 2024, 1:26am

Damn, it’s concerning that you have a much older AMD unit with the same problem; I was hoping it might just be a bad batch. I’ll have to seriously consider swapping to Intel in the hopes of improved stability.

lbkNhubert · October 28, 2024, 2:09am

It feels like they should swap out the mainboard at this point.

olenananas · October 28, 2024, 11:47am

Same(-ish?) thing happening to me. Seems like it’s more likely to happen under load, but I’ve had it happen before OS has even loaded as well, Graphics glitches on the screen and total freeze, any audio that was playing repeats probably the last buffer sent to the device.

olenananas · October 28, 2024, 1:34pm

After running memtest86+ for multiple loops, it seems like there are some issues.

Interestingly, it doesn’t fail at same addresses or nearly every test loop. And it very rarely fails on the first loop. The errors seem to come at batches, and at worst, I got over 300+ errors.

I am going to try and swap my two RAM sticks around and run the memtest overnight.

Isaac_Harper · October 28, 2024, 2:05pm

Dealing with this issue also,

Support eventually replaced 1 stick of RAM they decided was bad, on the first boot I had a crash in 20 minutes.

I reached back out and haven’t head back in a few days, wondering what will be tried next.

Mine is also an earlier batch

olenananas · October 28, 2024, 2:27pm

Ok, swapped the two RAM sticks around, still get the same addresses for errors, usually in the 0x000700000000-0x0007FFFFFFFF range.

Starting to look like a motherboard issue to me.

Adrian_Joachim · October 28, 2024, 2:43pm

Have you tried cleaning the contacts on the dimm with alcohol? DDR5 (especially as sodimm) is extremely sensitive to clean contacts.

James3 · October 28, 2024, 2:44pm

@olenananas
I don’t know how AMD arranges RAM addresses vs RAM chips.
It might be interleaved. I.e. byte 0 on RAM chip1, byte 1 on RAM chip2, byte 2 on RAM chip1, byte 3 on RAM chip2.
I Believe ECC is used at L1 Cache, L2 Cache, L3 Cache, on the DDR5 RAM chip.
I don’t believe ECC is used on the data lines between the CPU and the RAM chip.
In which case, it might be useful to find out how AMD arrange the RAM addresses, and then look in the errors for the offsets of the mismatching data, and work out from their which RAM chip has the problem, or if like you said, the mismatch does not move with the RAM chip, so must therefore be motherboard based problem.
I have always thought the ECC that also covers the data lines between chips is the way to go, but Intel and AMD only use that on Servers as far as I know.
The cause might also be RF interference. Do you have any mobile phones or washing machines or other electrical items near the laptop when doing the memtest ?

I found out how the AMD 7840 maps physical address to RAM chip. One RAM chip is channel A, the other is Channel B. The LSB bit of the 64 bit address selects the Channel A/B.
So, it is actually as I describe above. Byte 0 → Channel A, Byte 1 → Channel B etc.

Looking at the data from the pics you posted:
1e21 679d 38c6 5c6b < Expected
e121 679d c7c6 5c6b < Found

0741 6ba0 6cf4 4a2c < Expected
f841 6ba0 93f4 4a2c < Found

cc7b c3f8 0b75 4987 < Expected
337b c3f8 f475 4987 < Found

dcb4 2aee 0ad6 c2eb < Expected
23b4 2aee f5d6 c2eb < Found

The errors are all when LSB=0, and OK when LSB=1. See numbers in bold for the errors.
So, this points to a fault on one of the DDR5 RAM channels.
So, if you swap the RAM chips, the errors should move from being on LSB=0 to bad on LSB=1.
But, looking at some values further down, the LSB=1 are bad.
E.g.
Via CPU 3 == LSB=0 bad
Via CPU 9 == LSB=1 bad.

There is not a big enough sample size there though.
I therefore suspect a Motherboard change is needed, as it might even be the CPU having the bug, but the CPU is soldered on the MB.

olenananas · October 28, 2024, 3:12pm

I’m still going to run through the memtests with either single chip installed in both slots, that should give enough info to see if this actually is a mobo problem or if one of the sticks is faulty (which seems a bit unlikely as I’d expect far more consistent failures in that case).

And frankly, if running electrical items near FW laptop blows up memtest, the laptop (or any laptop where that is enough to cause memory failures) is not fit for market, and I’m pretty sure Framework engineers would agree.

Adrian_Joachim · October 28, 2024, 3:35pm

Solid approach.

It’s less that running them disturbs a properly working one but that it causes a marginal one to fail.

I am really looking forward to lpcamm, ddr5 sodimms are running pretty close to the limit even if they are working, those traces are just too damn long.

olenananas · October 28, 2024, 4:24pm

I found out how the AMD 7840 maps physical address to RAM chip. One RAM chip is channel A, the other is Channel B. The LSB bit of the 64 bit address selects the Channel A/B.

Oo, very nice. I can check this with the rest of my results. Thanks a bunch!

olenananas · October 28, 2024, 4:26pm

It’s less that running them disturbs a properly working one but that it causes a marginal one to fail.

Yea, this seems fair.

I am really looking forward to lpcamm, ddr5 sodimms are running pretty close to the limit even if they are working, those traces are just too damn long.

I wish consumer CPUs/memory would get ECC by default. Doubt that’s going to happen anytime soon.

Adrian_Joachim · October 28, 2024, 5:39pm

Same, especially since there aren’t a whole lot of parts missing. Unfortunately it seems like you can’t have ecc in lpcamm and you can’t have lpddr in camm (that can have ecc) so that sucks. From the 2 I personally would go with lpddr but sucks that you have to choose.

Paul_Brown · November 2, 2024, 8:44pm

Quick update: Framework support recently asked me to reset the mainboard by pressing the center button 10 times, which I did, but the random restarts are still happening. I also tried removing the internal Wi-Fi adapter and using a USB dongle, but that didn’t help either.

Although I’ve already run memtest86+ overnight twice without any issues, I might try it again given all the recent focus on RAM testing. However, since Framework said they’d escalate this issue, I expected more concrete feedback on the logs I submitted—instead, I’ve only received more general troubleshooting steps.

At this point, I’ve exhausted most options on my end, and everything still suggests a hardware or mainboard fault. Hoping they’ll take the next step soon. I’ll keep everyone posted.

FW_NT · November 3, 2024, 9:57am

Hi, I think we have a discussion for the same issue on this post. I ran through a roughly two months process with Framework support for diagnostic, leading to a mainboard replacement.

The point is that it could be barely anything before the mainboard - so you have to test everything to come to the conclusion of a faulty mainboard.

To help, I gave a summary of the steps I went through here.

Hopping this could help you find faster what you have, and eventually shorter the support processes.

olenananas · November 6, 2024, 6:51pm

I did some more investigation. I’m a bit stumped.

In my case, either of the RAM chips alone in either memory slot works just fine, hours of memtest86+ with no problems.

But if I plug both of them in at the same time, I start seeing the memtest86+ errors reported similar to those previously reported. I had the laptop running much more stably by blacklisting some of the memory addresses, but it ended up being temporary workaround (as in, it just doesn’t work anymore, at least not with the address ranges I put in).

Paul_Brown · November 17, 2024, 6:40am

Another update: Framework finally sent me a replacement mainboard, and I’ve been using it for a few days with the new mainboard without any random restarts.

Andrew_B · November 17, 2024, 11:19pm

I just got my replacement mainboard fitted too, early days but so far so good.

@James3 posted some interesting info about the EC firmware in my thread about the same crashing issue; just cross-posting for anyone else following along: Framework 13 AMD Hard Crashing Issue - #20 by James3

Edit: still crashing with the replacement mainboard
Edit2: not crashing with my 5600MHz now fitted

chrols · November 18, 2024, 9:15am

My AMD 13 7840U seems to have developed the same issue. Random reboots which can happen any time, though much more likely during heavy loads.

During light loads it can be weeks between reboots. During heavy loads I would guess it’s typically around two or three hours on average before it reboots or locks up.

journalctl gives no indications about any issue. Memtest86+ and ectool panicinfo does not indicate that anything is wrong.

@Paul_Brown: Quite interested to hear how things turn out with the motherboard replacement. Did you do anything to directly implicate the motherboard or was it simply that you had ruled out everything else?

I quite like the laptop overall, but the fact that it can randomly decide to throw away my work is of course not welcome