[TRACKING] WD_BLACK SN850 sudden death

Looks like it’s my turn now :cry: .

I have a 11th gen Framework 13, received in March 2022. It had an SN850 (1TB) in the internal slot. The machine was fine for a year and a half.

The other day, I closed the lid to go for a walk. Came back 2hrs later, machine is off. Turn it on, and I get the dreaded “Default Boot Device Missing or Boot Failed” prompt. I cannot tell what exactly happened in the meantime, since the machine was configured to enter s2idle, and then hibernate 60min later — maybe it crashed in the process, or maybe it did hibernate. But after powering on, the SSD is gone.

More precisely, neither the BIOS can initialize it (so it doesn’t show up in the boot menu and it stop with the error above), nor a Linux kernel booted from external storage. Linux can see that there is a device on the bus, using lspci (and report the model, but not the serial), but the nvme driver notes that it gave up on initialization and the block device never shows up. This happens both in my Framework, and if I transplant the SSD into a spare desktop I have on hand. The drive has apparently entered the state it cannot recover from.

Framework BIOS is 3.17. SSD firmware was up-to-date as of a few months ago (I cannot recall the exact version, but it was updated sometime in 2022, and there was never a newer firmware after that). The SSD was about 1% spent according to nvme smart-log, and had about 25TBW (out of 600). The machine is running Arch Linux, and the only setting that I can think of that could be connected to this is /sys/module/pcie_aspm/parameters/policy, which I used to set to powersupersave.

I found several threads, both here and on the alien site, mentioning similar symptoms, but I never found a resolution. I don’t really care about the SSD itself, but I very much do care about the data it apparently ate. I got a new SSD, emphatically not from WD, and I managed to cobble together most of my digital life from backups. But unless I did something totally wrong, and unless I manage to wake the drive up and recover the data, I’d like to warn any passers-by looking for advice on SSDs to avoid Western Digital. Given their tendency for sudden and complete loss of data, I’d say their SSDs are entirely unfit for purpose.

So, this is a sample of one out of how many of this model that were manufactured?

3 Likes

SSDs in general have always had a few inherent risks of total data loss if something fails, and nothing is immune to defects or other errors.

One WD drive dying, without proper postmortem, falls into the “correlation does not imply causation” bucket.

3 Likes

On the one hand I’m — obviously — pretty emotional about about this right now, and you are both right. Failures can always randomly happen.

On the other hand, I think that you are both being disingenuous. Problems with this SSD are easily googlable, and well attested, even on this forum:

Of course, the tiny sample of 1 164 about a few hundred people is negligible compared to the total production volume. If every single defective SN850 was accounted for, then dividing by the total volume of, say, millions, gives a miniscule failure probability. But I don’t think that every single defective SN850 has its own post on one of these forums, and we certainly cannot ask WD what the true failure rate is. Meanwhile, the noisy web signal is there.

And, indeed, just because it’s easy to find reports of failing SN850s it doesn’t mean that they are more prone to failure than other SSDs. For instance, it could mean that their market share dwarfs the others — which I doubt — or that there is a systemic tendency for owners of, say, Samsung not to be able to find their WiFi password and complain — which I’ll give them the benefit of doubt for. More realistically, on this forum, it’s because that SSD is sold by default with Framework and described as “tested with” by them. But the overall incidence of these issues is too high for my own comfort.

Finally, I’m clearly not in a position to say what exactly happened here, but I can make an informed guess. The SSD was happily working for a year and half, ruling out a freak manufacturing defect. It consistently appears on the bus (of various computers), but refuses the NVMe handshake, ruling electrical damage out. Instant death suggests it’s not NAND wearing out, which happens gradually and the controller is supposed to be able to at least report the process. Leaving us with… oh yeah, controller bugs. Which is corroborated by the relatively higher number of firmware updates that SN850 had compared to its peers (except maybe 990 Pro and its recent read-only scare).

This brings me to the inherent risks of SSDs. Besides flash wear (and maybe electric damage), I think that the only truly inherent risk is the controller malfunctioning. But that’s a sliding scale, and it really depends on what the market is willing to tolerate. I am not willing to tolerate what I currently presume was a data-eating bug, in what is supposed to be a primary storage device.

Hence my initial warning.

1 Like

Maybe not, but you may be able to find out the MTBF. I know one of my colleagues who oversaw a bit barn (collecting data from satellites) worked out that the failure rate of drives he had matched the manufacturer MTBF when he worked the figures through.

My workplace uses a lot of Dell Latitudes, it’s honestly surprisingly common for them to ship out with hardware issues such as a bad motherboard, a bad sound device or speakers, a bad webcam, etc. Moreover the amount of issues that crop up over the course of 1 year of usage is crazy.

I’ve come to accept that hardware issues are just part of any manufactured item and understand the importance of a good warranty (and support for that warranty) on everything because it means a company values a product that lasts. Unfortunately failed components happens so many ways and is hard to attribute to anything but it happens and often it’s not user error.

Please open a support ticket so we can drill down on this a bit further.

@Matt_Hartley I did, in the meantime. But sadly, the SSD was not sourced from Framework (weird rules between my institution and me), so the ticket was closed as a matter of policy. I was hoping that the case can provide your engineers with useful insight.

Appreciate you sharing this with us, I will keep an eye out on my own SN850 which has been humming right along thus far. If we spot a pattern, we’ll track it and pass it along for sure.

This happened to me today. Also a 11th gen. Framework 13 from March 2022; mine has seen very heavy use, powered on pretty much continuously. The exact model of the SSD is WDS100T1X0E-00AFY0. I placed the SSD in a USB enclosure and attached it to another host, and it doesn’t look so good:

usb 3-3: new high-speed USB device number 7 using xhci_hcd
usb 3-3: New USB device found, idVendor=0bda, idProduct=9210, bcdDevice=20.01
usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 3-3: Product: RTL9210B-CG
usb 3-3: Manufacturer: Realtek
usb 3-3: SerialNumber: 012345678909
usb-storage 3-3:1.0: USB Mass Storage device detected
scsi host2: usb-storage 3-3:1.0
scsi 2:0:0:0: Direct-Access     Realtek  RTL9210          1.00 PQ: 0 ANSI: 6
sd 2:0:0:0: Attached scsi generic sg0 type 0
sd 2:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 2:0:0:0: [sda] Sense Key : Illegal Request [current] 
sd 2:0:0:0: [sda] Add. Sense: Invalid command operation code
sd 2:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
sd 2:0:0:0: [sda] 0-byte physical blocks
sd 2:0:0:0: [sda] Test WP failed, assume Write Enabled
sd 2:0:0:0: [sda] Asking for cache data failed
sd 2:0:0:0: [sda] Assuming drive cache: write through
sd 2:0:0:0: [sda] Attached SCSI disk
sd 2:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 2:0:0:0: [sda] Sense Key : Illegal Request [current] 
sd 2:0:0:0: [sda] Add. Sense: Invalid command operation code

A bug in WD Dashboard is keeping me from launching it, so until they get back to me, I won’t know if it’s able to see the drive and maybe update its firmware.

You have my sincere sympathy.

Did yours die after going to sleep too? Mine says MDL: WDS100T1X0E-00AFY0, same as yours. And I didn’t mention that I also use my Framework heavily.

I didn’t put mine in an enclosure, but this is what the kernel says when it is in the M.2 slot:

nvme 0000:01:00.0: platform quirk: setting simple suspend
nvme nvme0: pci function 0000:01:00.0
nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0

Yes, I shut my Framework down for about 36 hours, and then it was dead at next startup.

So - take this with a grain of salt, but this post has got me looking into whether I have got DEALLOCATE patched through correctly [nope, I forgot to pass my initramfs a kernel flag to do that when it mounted the encrypted root partition – thanks for making me think to double check!]. My anecdote is that I had not one, but two SSDs fail on me in short order (~6 months from purchase), which I eventually tracked down to their firmwares having A Bad Day because I was using full-disk encryption without TRIM passthrough. As soon as I got TRIM going through correctly, both those drives sprang back to life, and haven’t given me a single lick of trouble since. That was SATA, this is NVMe, apples, oranges, but it may still be worthwhile to make sure your filesystem supports DEALLOCATE (/TRIM/unmap), and that if you’ve got any intervening layers (LVM, cryptsetup), that they’re also passing the appropriate command through.

As a second aside, the 1TB SN850X that I got (also not Framework sourced) defaulted to 512b sectors. I had to use the nvme tool (per Switching your NVME ssd to 4k - Bjonnh.net ) to manually convert my drive to 4k blocks before I did my OS install. Having the block size be 4k may or may not have any long-term benefits as far as wear goes, but given that flash is frequently garbage collected in 4k (or larger) chunks, it makes me sleep a little bit better at night having mine in 4k mode. You can also use the nvme tool to dump SMART data and other stats/logs from the drive if it’s at least enumerating to a /dev/nvmeXnX device – perhaps you can glean more about the specific failure from there [derp… looks like you already did that!].

FWIW it had discard configured end-to-end (you can make it stick in the LUKS v2 header, too). And it was using 4k blocks.

No SMART (or anything) now, but I do have about 6mo of daily smart stats logged. Other than an unclean shutdown every 25 power cycles on average, nothing suspicious.

I got the dreaded failure today. Exact same make and model and batch.

1TB WD Black SN850 - WDS100T1XOE-00AFY0

Mine was ordered as part of my DIY components.

Is there a wider ticket open for these issues with Western Digital for some form of RMA.

1 Like

@Matt_Hartley I would like to throw my hat in the ring here as well. Same failure. Drive is dead, tested via usb to m.2 adapter.

500GB WD Black SN850 WDS500G1X0E - 00AFY0

Looks like the same batch as others. Also have the same question as @Adam_Sproul , this feels unacceptable to die after such little use. I have yet to have any SSD fail this suddenly on me.

Okay folks, now we need to identify common factors so we can see if this is a bad batch or something else. Please use the following template - please do not included tons of other details, we need to keep this spreadsheet friendly:

How is the SN850 attached: Internally or externally/USB? Do NOT include non-SN850 examples as this is not the same.

Was LUKS in use: Yes/No

Died after sleep or died on reboot/cold boot: Answer here as after sleep or reboot/cold boot.

Was the drive purchased from us? If so, which date(s): Yes/No, date if applicable.

Which Linux distro and kernel used in this instance: Please provide your distro and kernel version here.

2 Likes

How is the SN850 attached: Internally.
Was LUKS in use: Yes.
Died after sleep: Probably after coold boot [1].
Was the drive purchased from us: No.
Which Linux distro and kernel: Arch, 6.5.9.arch2-1.

[1] — Unclear. The system entered s2idle, but was set up to wake up 60 min later, hibernate, and shut down. I found it powered off. I think that if the SSD died on resume from suspend, the kernel would either continue running, or panic, reboot after 120 seconds, and get stuck in UEFI. I think that only successfully resuming, hibernating, and powering off would leave it in the state I found it in, which is fully off.

How is the SN850 attached: Internally
Was LUKS in use: Yes
Died after sleep or died on reboot/cold boot: Probably cold boot
Was the drive purchased from us? If so, which date(s): Yes, 2022-03-15
Which Linux distro and kernel used in this instance: Qubes OS 4.1.x, Linux kernel 5.10 series