ECC support?

I’ll agree with you on that but it is as you mentioned, it maybe be slight or marginal. For casual users, I think data security on SSDs are more of a concern or at the forefront.

2 Likes

And how can you ensure that, if every single bit going to the SSD must go through the maybe failure inducing non-ECC RAM? How to even detect such errors?

I have collected lots of data over the years and there are at least dozens of files I have seen with some kind of corruption (most easily found in compressed video/images). I have not found such a file from the timespan where my NAS has used ECC yet.
That might have been software bugs, disk errors, file system shenanigans or any number of causes but I take the added safety of ECC for the 10% to 20% additional cost on RAM any day over not even being able to discover errors at all.

My NAS is currently a cheap Intel Celeron on a workstation chipset mATX board, that whole system without disks cost me roundabout 250€ years ago. There is no way I’m going back to non-ECC on anything that runs longer than a few hours at a time.

3 Likes

I guess there is some (a lot?) more cost in testing the whole platform (all possible combinations CPU, MB, RAM) to guarantee correct functioning of it. Plus added support costs due to more support requests of people not knowing how it works.

1 Like

If you had read what I linked in my post, you would have seen that AMD likely didn’t do any of that with their Ryzen CPUs, and just left support there. Hence the “unofficially supported”.

3 Likes

On a NAS where the whole point of is it to reliably store data does ECC have the strongest argument for it. My point was more so on SSDs failure rates in laptops as you usually only have 1 drive in it.

2 Likes

Actually the point was not really the safety of data on the SSD but the way it gets there.
Every time you use read/use/store any data on your SSD it moves through your RAM. If that RAM causes undetectable errors and the errors are not in critical code paths you won’t know the data is now broken. The SSD will store that broken data happily and will do its best to ensure it stays exactly the way it was transfered to the SSD.

I agree that the impact on a NAS is probably bigger since it often serves multiple users and stores more data. But in principle this applies to anything storing and working with any data.

Actually thinking about it, Laptops often have also really long between reboots because they live a lot of their time in standy with the RAM powered. Errors can accumulate over time this way and the probability of hitting relevant data grows.

4 Likes

You’re right, didn’t read so far.
TBH I didn’t even know we’re talking about unofficial support rather than official :see_no_evil:

1 Like

There will be some cost incurred by supporting ECC (for OEMs, not AMD). At least the additional traces on the board are needed to be considered in board design and a bit of initialization code in the BIOS must be implemented. Some additional settings (on/off, scrubbing) are usefull too, but not really required. If the support is unofficial/just enough to get it working I suspect the cost could be quite low but that is pure speculation on my part.

1 Like

DDR5 is a marked improvement - it’s got built in data checking that’ll correct single bit errors on the memory module for the first time.

1 Like

DDR5 needs the internal on-die error checking to counter the negative effects of higher speeds and smaller structure sizes.
This on-die ecc does NOT allow for detection/logging of errors and was also used in some late production DDR4 memory to get it usable. DDR5 just made it standard and “marketable”.

DDR5 actually improves ECC on UDIMMS by allowing 8 bits of parity for each sub channel. This means 8 bit parity per 32 bit data instead of 8 bit for 64 bit data as it was for DDR4. This is optional for UDIMMS, both is possible. SO-DIMMS AFAIK only support 4 bits per sub channel.

9 Likes

Whoops! It was late, and my tired brain completely missed that.

1 Like

In my years of advising hardware purchases (~1k users), ECC has never been important to almost all of them. I’m surprised this community values ECC this much.

1 Like

People who understand computers, understand how important ECC is. Sure it may not be an issue for your gaming PC to crash once a week or so, but ECC would prevent that and allow you to know the cause.

6 Likes

ECC can’t prevent Windows or drivers crashing by itself, which is the source of most Windows-related BSODs. ECC also can’t prevent straight-up hardware failure that’s not on RAM.

Some other evidence for ECC not being particularly relevant:

We’ve observed uncorrectable ECC error instances exactly 6 times over the last 5 years in a bit over a million server installations. Every one of these >1 ECC errors was tracked down to faulty hardware and not “sunspots”, “glitches”, or any such folklore. These are multi-bit ECC errors, so obviously single-bit ECC errors would be far more common, but in every instance of multi-bit ECC error, when the hardware was subsequently tested directly, each one resulted in a constant >2bit error (meaning it was always a cascading failure at the time once that memory address block was accessed). Summary of anecdotal evidence supporting very low memory error rate: in >1mil ECC server installs, 6 of 6 multi-bit ECC errors were due to faulty hardware that would be found immediately upon boot up if POST testing was set to “FULL”.

Contrary to what others have (incorrectly) stated here, all single-bit ECC errors are corrected on-the-fly. Any multi-bit ECC errors result in an immediate kernel panic so that the impact is isolated to service availability but never data corruption.

ECC errors are so uncommon, to combat potential memory corruption on my non-ECC Optiplex Micro 3050 Proxmox server (running 6x16TB SATA drives in RZ2) I have simply scheduled a nightly reboot and always fully test memory on boot.

5 Likes

Now you’re just being silly for the sake of it. The subject is ECC. Obviously, ECC will detect and correct some failures in your ram. I don’t use Windows so I can’t help you with any of that and none of that had anything to do with ECC ram.

Also to your wall of random ecc statistics, they don’t agree with my real world experience. In my profession, we see memory errors quite often among our many servers. In any case, it’s always better to have the option to use it if possible.

11 Likes

Let me quote the most relevant parts:

Which would not even have been detectable without ECC.

Yes, single bit errors happen.

They are not corrected without ECC and do not automatically cause kernel panics without ECC.

Prevention seems to need active steps like rebooting regularly which would not be necessary with ECC.

I do not take your post as particularly good reason for not having ECC :slight_smile: .
In conclusion that all amounts to errors happen and ECC is useful. If your data is worth it to you is your decision. I would like hardware vendors to leave that choice to me. No real server vendor will even sell servers without ECC except in the absolute bargain bin tier of hardware.

11 Likes

In my years of driving, a seat belt has never been important. I’m surprised people value seat belts so much.

FWIW I have seen memory errors caught by ECC on my workstation for my job (at the time).

12 Likes

It becomes more relevant the more memory you use. Most applications do not use a lot of memory and some memory processes cause more errors than others like heap stacks. If it is a server that is mostly idle, a laptop that only browses the internet and other low memory and especially low critical memory devices likely have no real use for it.

Developers and other occupations with high computer utilization see a lot more of this. These same people are also a lot more picky with their computers and a lot more frustrated with designed to fail devices, especially on the corporate side. Hence why it is very popular here. Normal computer users would just get a good value laptop. Not something this expensive.

11 Likes

ECC error logging just told me that my overclock on my desktop PCs RAM is not stable any more. About 1 corrected error every hour. I configured that overclock about a year ago with extensive (days of memory load) and error free stability testing. So either my memory has degraded, my powersupply is less stable than before or a plethora of other possible reasons can be the cause.
But this example shows how ECC is a useful addition even for non professionally used systems. These hourly errors would have slowly corrupted my data and it probably would need to get a lot worse for me to actually notice something being wrong with hardware. Instead of knowing there are corrected errors happening I would see software crash every now and then, maybe my checksumming file system would catch some incorrectly checksummed files or stuff like that. Everything easily attributable to software bugs, updates or whatever else changed. Hardware is usually the last thing I think about in such cases.
One could make the argument that it is my fault for overclocking the RAM, but think about how many people run XMP/EXPO profiles on non-ECC memory. Typical advice is doing some rounds of memtest++ and call it a day for such configs. My RAM runs not even close to settings that the same Samsung B-die memory was used for in “gaming” sticks.
I’m happy to be in the know on this potential hardware problem and can now dial the overclock a bit back or give it a bit more voltage and just keep an eye on my error logs.

17 Likes

I’m amused how invested some people are in telling others that they do not need ECC. Is it really worth evangelizing?

16 Likes