@Deuce Correct, key questions. Hopefully Framework can answer more clearly soon. I suspect AMD may need to assist and fix things on the firmware side too. Assuming it can be fixed via microcode/firmware.
That’s not true. It really depended on implementation. I’ve read many newer AsRock (consumer) does correction, but reporting depends on the CPU (PRO only), whereas ASUS actually does report with all supported CPUs.
I have many machines with Ryzen + ECC UDIMMs, and one that I run 24/7 had ONE Corrected ECC error logged (FreeBSD) in over 3 years. It’s apparently super rare, especially for light use like mine. Note this machine runs Ryzen 3 PRO 2200GE.
kernel: MCA: Bank 15, Status 0x9c2040000000011b
kernel: MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
kernel: MCA: Vendor "AuthenticAMD", ID 0x810f10, APIC ID 0
kernel: MCA: CPU 0 COR GCACHE LG RD error
kernel: MCA: Address 0x40000002f012b80
kernel: MCA: Misc 0xd01b0fff01000000
I painstakingly decoded it by referencing AMD document 56255
o Bank 15 = UMC
o LG is Level-Generic
o Status:
o ErrorCode
o Error Code: Memory
o Transaction Type: Generic
o Memory Transaction Type: Generic Read (RD)
o Cache Level: Generic (LG)
Also note, the same silicon dies are used for various products, including EPYC Embedded (e.g. 3451), so the chip itself is fully capable.
The fact that RAM errors (and corrections) are so rare makes the feature less essential. Other anecdotal reports also indicate it being a once-in-a-year event. If Windows sfc /scannow is detecting errors, I wouldn’t immediately blame RAM. It’s Windows, after all.
Does ECC help against Rowhammer type attacks? If so, it should be mandatory!
ECC makes Rowhammer more difficult, but doesn’t protect against it: Rowhammer Data Hacks Are More Dangerous Than Anyone Feared | WIRED
DDR5 RAM implements ECC on the memory dice, where Rowhammer happens, so it’s already mandatory for the part affected by Rowhammer.
If you’re the target of a rowhammer attack, the Framework Laptop is probably not for you.
That’s not the question. I’d consider any new hardware buggy that is susceptible to long-known attacks like rowhammer. Being not exactly cheap, I do expect the designers not to cut corners in the safety department, at least not as far as foreseeable dangers to data integrity go.
I use my backups less than once-per-year, but don’t find it any less essential for that. While it’s axiomatic that things work when they’re working, having no way to know things aren’t working leads to all sorts of misdiagnostics when there’s a problem.
I would love ECC support. I own a mini PC with an embedded AMD CPU supporting ECC memory. Detecting an error and crashing is the best case scenario in light of memory corruption. Worst case scenario is the error going undetected and corrupting application state and storage. Anecdotally, I have frequently ran into faulty memory in servers with 64 GB or more memory, especially when they run processes that use tons of heap memory. This blog post from a JVM developer offers lots of technical insight on how most bug reports he receives turn out to be a case of faulty memory. One paragraph I would like to emphasize from the blog post is the following.
As the industry, we know that memory errors are common. An old, but widely cited paper ballparks the incidence rate at 25 events per Gbit per year. If we boldly assume the uniformity, then on a 128 GB machine, that amounts to about 3 events per hour!
Anecdotally, I have witnessed this on servers and my work laptop, where a segfault occurs in a JVM process using gigabytes of memory, and a memory test reveals that one of my RAM sticks is faulty.
With that said, I think ECC support is a “nice to have” feature. But I also think implementing ECC support is outside of Framework’s control, since it ultimately depends on Intel or AMD deciding to allow this feature in consumer hardware (and historically, Intel has opposed ECC support in consumer hardware, so our only hope is AMD allowing this). The mini PC I mentioned at the start of my post is actually an “industrial PC”, not intended for everyday consumers.
Overall: “nice to have” feature, but I doubt Framework can do much about it since its up to Intel or AMD to decide.
Yes ultimately it’s up to AMD in this case. But they do list ECC on their spec sheet as supported if the platform OEM chooses to support it. And we know the chiplets are capable.
We know the answer is not so clear cut as the platform OEM (here Framework and their partners) needs the proper firmware from AMD for the Framework-specific BIOS versions.
Since Framework is AMD’s direct customer here, we need to hear back from them if AMD will help or not. And hopefully Framework has not done something silly like not route the proper ECC signal lines on the mainboard.
Very unlikely for Framework to say. Regardless of how the work on EEC is going, Framework has worked with AMD on the Framework-16. It is not good (for Framework) to then turn around and say something like “There are issues with EEC, and they can’t be solved without adequate support from AMD, which we aren’t getting” aka, anyone who wants EEC, go blame AMD.
We must just assume that Framework is working on enabling EEC, and any perceived delay may not even be their fault. But we don’t get to know for sure. Other boards may have AMD EEC support, but we don’t know how much assistance from AMD it took to get it working, and how much Framework is getting.
And I’d like to remind, the Framework Laptop 16 isn’t even shipping yet, won’t be for awhile.
Yeah, no one expects Framework to be unprofessional and say negative things about AMD. No one is asking for that.
But that’s different than updating on relative priorities, progress, etc. The latter would be nice, when appropriate.
@Framework could we please get an update on the current status/knowledge, and what the future plan/timeline might be if ECC support hasn’t been demonstrated yet? Its a very important feature for some workloads. Thank you.
Afaik they aren’t even sure if it is actually supported on the platform (despite it being officially supported) - which implies to me they tried, and it didn’t work immediately, and they haven’t yet been able to get clarification from AMD on why that is and how to get it working. Or something entirely different, but that’s the feeling I got.
asking here is unlikely to get you an answer from the team, as this is a user forum, and if Framework could give an update they would have. If you really want an official answer you should send a message to support, but you are unlikely to get any new information that way.
Not sure I saw it in this thread, so adding. It seems that some Framework documentation on ECC can be found in the FAQ regarding memory.
For what it’s worth, I crave ECC as well.
I’ve not seen any ECC above 4800MHz, so that’d impact performance.
There’s some basic error correction on the actual DDR5 memory chips now, which is part of the spec. Just not full ECC. Which is an improvement.
There seems to be this which looks a lot like 5600 unbuffered ecc sodimms which would fit the bill. Have not checked other manufacturers.
Not really, the on-die-ECC essentially just corrects errors caused by higher speed and density in the memory chips for DDR5. I think best case expectation would be slightly better resilience, but probably nothing to write home about.
This is probably an artifact of AMD’s “take it or leave it” stance on ECC for “consumer” SKUs. At least on desktop there are lots of non-pro processors working just fine with ECC on many AM4 (probably now AM5 too) boards. AFAIK the only way to get “official”/“validated” ECC support is ThreadRipper and higher SKUs. Nevertheless the hardware should be capable at least down to all desktop SKUs but non-pro APUs (Ryzen xxxxG). I use ECC in a desktop Ryzen 7 3700X and overclocked it to force it to error out and error detection/correction seems to work as expected. Actually I currently run it with a heavy overclock (Samsung B-die) because ECC gives me (enough) peace of mind to do this. I would never run normal RAM this much over spec. Thinking of it I would prefer not to run non-ECC at all
Practically every internal high speed data bus has some form of at least error detection (S-ATA,PCIe, etc.), but the main memory bus somehow does not for many/most consumer systems.
Yeah really. It’s better than no error correction. Corrects errors on the chips on the DIMM, but not in transmission. At work when we see memory errors in the server kit, it’s usually the former, not the latter.
As a standard feature I approve.