[TRACKING] WD_BLACK SN850 sudden death

Alan_Pearce · November 19, 2023, 6:04pm

Maybe not, but you may be able to find out the MTBF. I know one of my colleagues who oversaw a bit barn (collecting data from satellites) worked out that the failure rate of drives he had matched the manufacturer MTBF when he worked the figures through.

resonantenergy · November 20, 2023, 5:26am

My workplace uses a lot of Dell Latitudes, it’s honestly surprisingly common for them to ship out with hardware issues such as a bad motherboard, a bad sound device or speakers, a bad webcam, etc. Moreover the amount of issues that crop up over the course of 1 year of usage is crazy.

I’ve come to accept that hardware issues are just part of any manufactured item and understand the importance of a good warranty (and support for that warranty) on everything because it means a company values a product that lasts. Unfortunately failed components happens so many ways and is hard to attribute to anything but it happens and often it’s not user error.

Matt_Hartley · November 20, 2023, 8:22pm

dk505:

Looks like it’s my turn now .

I have a 11th gen Framework 13, received in March 2022. It had an SN850 (1TB) in the internal slot. The machine was fine for a year and a half.

The other day, I closed the lid to go for a walk. Came back 2hrs later, machine is off. Turn it on, and I get the dreaded “Default Boot Device Missing or Boot Failed” prompt. I cannot tell what exactly happened in the meantime, since the machine was configured to enter s2idle, and then hibernate 60min later — maybe it crashed in the process, or maybe it did hibernate. But after powering on, the SSD is gone.

More precisely, neither the BIOS can initialize it (so it doesn’t show up in the boot menu and it stop with the error above), nor a Linux kernel booted from external storage. Linux can see that there is a device on the bus, using lspci (and report the model, but not the serial), but the nvme driver notes that it gave up on initialization and the block device never shows up. This happens both in my Framework, and if I transplant the SSD into a spare desktop I have on hand. The drive has apparently entered the state it cannot recover from.

Framework BIOS is 3.17. SSD firmware was up-to-date as of a few months ago (I cannot recall the exact version, but it was updated sometime in 2022, and there was never a newer firmware after that). The SSD was about 1% spent according to nvme smart-log, and had about 25TBW (out of 600). The machine is running Arch Linux, and the only setting that I can think of that could be connected to this is /sys/module/pcie_aspm/parameters/policy, which I used to set to powersupersave.

I found several threads, both here and on the alien site, mentioning similar symptoms, but I never found a resolution. I don’t really care about the SSD itself, but I very much do care about the data it apparently ate. I got a new SSD, emphatically not from WD, and I managed to cobble together most of my digital life from backups. But unless I did something totally wrong, and unless I manage to wake the drive up and recover the data, I’d like to warn any passers-by looking for advice on SSDs to avoid Western Digital. Given their tendency for sudden and complete loss of data, I’d say their SSDs are entirely unfit for purpose.

Please open a support ticket so we can drill down on this a bit further.

dk505 · November 21, 2023, 2:28am

@Matt_Hartley I did, in the meantime. But sadly, the SSD was not sourced from Framework (weird rules between my institution and me), so the ticket was closed as a matter of policy. I was hoping that the case can provide your engineers with useful insight.

Matt_Hartley · November 21, 2023, 7:27pm

Appreciate you sharing this with us, I will keep an eye out on my own SN850 which has been humming right along thus far. If we spot a pattern, we’ll track it and pass it along for sure.

jrenken · November 28, 2023, 1:25am

This happened to me today. Also a 11th gen. Framework 13 from March 2022; mine has seen very heavy use, powered on pretty much continuously. The exact model of the SSD is WDS100T1X0E-00AFY0. I placed the SSD in a USB enclosure and attached it to another host, and it doesn’t look so good:

usb 3-3: new high-speed USB device number 7 using xhci_hcd
usb 3-3: New USB device found, idVendor=0bda, idProduct=9210, bcdDevice=20.01
usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 3-3: Product: RTL9210B-CG
usb 3-3: Manufacturer: Realtek
usb 3-3: SerialNumber: 012345678909
usb-storage 3-3:1.0: USB Mass Storage device detected
scsi host2: usb-storage 3-3:1.0
scsi 2:0:0:0: Direct-Access     Realtek  RTL9210          1.00 PQ: 0 ANSI: 6
sd 2:0:0:0: Attached scsi generic sg0 type 0
sd 2:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 2:0:0:0: [sda] Sense Key : Illegal Request [current] 
sd 2:0:0:0: [sda] Add. Sense: Invalid command operation code
sd 2:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
sd 2:0:0:0: [sda] 0-byte physical blocks
sd 2:0:0:0: [sda] Test WP failed, assume Write Enabled
sd 2:0:0:0: [sda] Asking for cache data failed
sd 2:0:0:0: [sda] Assuming drive cache: write through
sd 2:0:0:0: [sda] Attached SCSI disk
sd 2:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 2:0:0:0: [sda] Sense Key : Illegal Request [current] 
sd 2:0:0:0: [sda] Add. Sense: Invalid command operation code

A bug in WD Dashboard is keeping me from launching it, so until they get back to me, I won’t know if it’s able to see the drive and maybe update its firmware.

dk505 · November 28, 2023, 2:18am

You have my sincere sympathy.

Did yours die after going to sleep too? Mine says MDL: WDS100T1X0E-00AFY0, same as yours. And I didn’t mention that I also use my Framework heavily.

I didn’t put mine in an enclosure, but this is what the kernel says when it is in the M.2 slot:

nvme 0000:01:00.0: platform quirk: setting simple suspend
nvme nvme0: pci function 0000:01:00.0
nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0

jrenken · November 29, 2023, 12:05am

Yes, I shut my Framework down for about 36 hours, and then it was dead at next startup.

Travis_Snoozy · December 5, 2023, 4:15am

So - take this with a grain of salt, but this post has got me looking into whether I have got DEALLOCATE patched through correctly [nope, I forgot to pass my initramfs a kernel flag to do that when it mounted the encrypted root partition – thanks for making me think to double check!]. My anecdote is that I had not one, but two SSDs fail on me in short order (~6 months from purchase), which I eventually tracked down to their firmwares having A Bad Day because I was using full-disk encryption without TRIM passthrough. As soon as I got TRIM going through correctly, both those drives sprang back to life, and haven’t given me a single lick of trouble since. That was SATA, this is NVMe, apples, oranges, but it may still be worthwhile to make sure your filesystem supports DEALLOCATE (/TRIM/unmap), and that if you’ve got any intervening layers (LVM, cryptsetup), that they’re also passing the appropriate command through.

As a second aside, the 1TB SN850X that I got (also not Framework sourced) defaulted to 512b sectors. I had to use the nvme tool (per Switching your NVME ssd to 4k - Bjonnh.net ) to manually convert my drive to 4k blocks before I did my OS install. Having the block size be 4k may or may not have any long-term benefits as far as wear goes, but given that flash is frequently garbage collected in 4k (or larger) chunks, it makes me sleep a little bit better at night having mine in 4k mode. You can also use the nvme tool to dump SMART data and other stats/logs from the drive if it’s at least enumerating to a /dev/nvmeXnX device – perhaps you can glean more about the specific failure from there [derp… looks like you already did that!].

dk505 · December 5, 2023, 8:18am

FWIW it had discard configured end-to-end (you can make it stick in the LUKS v2 header, too). And it was using 4k blocks.

No SMART (or anything) now, but I do have about 6mo of daily smart stats logged. Other than an unclean shutdown every 25 power cycles on average, nothing suspicious.

Adam_Sproul · December 20, 2023, 10:51pm

I got the dreaded failure today. Exact same make and model and batch.

1TB WD Black SN850 - WDS100T1XOE-00AFY0

Mine was ordered as part of my DIY components.

Is there a wider ticket open for these issues with Western Digital for some form of RMA.

Steve1 · February 9, 2024, 10:22pm

@Matt_Hartley I would like to throw my hat in the ring here as well. Same failure. Drive is dead, tested via usb to m.2 adapter.

500GB WD Black SN850 WDS500G1X0E - 00AFY0

Looks like the same batch as others. Also have the same question as @Adam_Sproul , this feels unacceptable to die after such little use. I have yet to have any SSD fail this suddenly on me.

Matt_Hartley · February 13, 2024, 6:21pm

Okay folks, now we need to identify common factors so we can see if this is a bad batch or something else. Please use the following template - please do not included tons of other details, we need to keep this spreadsheet friendly:

How is the SN850 attached: Internally or externally/USB? Do NOT include non-SN850 examples as this is not the same.

Was LUKS in use: Yes/No

Died after sleep or died on reboot/cold boot: Answer here as after sleep or reboot/cold boot.

Was the drive purchased from us? If so, which date(s): Yes/No, date if applicable.

Which Linux distro and kernel used in this instance: Please provide your distro and kernel version here.

dk505 · February 13, 2024, 6:56pm

How is the SN850 attached: Internally.
Was LUKS in use: Yes.
Died after sleep: Probably after coold boot [1].
Was the drive purchased from us: No.
Which Linux distro and kernel: Arch, 6.5.9.arch2-1.

[1] — Unclear. The system entered s2idle, but was set up to wake up 60 min later, hibernate, and shut down. I found it powered off. I think that if the SSD died on resume from suspend, the kernel would either continue running, or panic, reboot after 120 seconds, and get stuck in UEFI. I think that only successfully resuming, hibernating, and powering off would leave it in the state I found it in, which is fully off.

jrenken · February 13, 2024, 10:09pm

How is the SN850 attached: Internally
Was LUKS in use: Yes
Died after sleep or died on reboot/cold boot: Probably cold boot
Was the drive purchased from us? If so, which date(s): Yes, 2022-03-15
Which Linux distro and kernel used in this instance: Qubes OS 4.1.x, Linux kernel 5.10 series

Fabien_DESPREZ · May 21, 2024, 12:16pm

How is the SN850 attached: Internally
Was LUKS in use: No
Died after sleep or died on reboot/cold boot: After reboot
Was the drive purchased from us? If so, which date(s): No, 2022-04-08
Which Linux distro and kernel used in this instance: Ubuntu 22.04.4, kernel ??

George_Coss · June 1, 2024, 4:36am

I think I saw the exact same failure today. Same drive and fw gen. Running Ubuntu and the drive died after a restart. Fwiw I’ve trying to update the fw BIOS over the last couple of days. I submitted a support request.

Michael_Roach · June 3, 2024, 10:12am

I just found this thread as I opened up my laptop to start work today and the drive was completely dead. I booted off a live USB and the drive doesn’t even show up in lsblk or nvme list. Nothing in dmesg. I put it in an M.2 USB-C enclosure and hooked it up to another machine and I don’t even see anything in dmesg, so this seems completely dead.

Attached: Internal M.2
LUKS: Is this the default full disk encryption with Fedora? If so, yes.
Died: After waking from sleep
Purchased from Framework in October 2011 as part of my DIY kit
Linux: Fedora 39, whatever the most recent kernel was for that. It was up to date.

WDS500G1XHE-00AFY0
DOM: 3-May-2021

Chris_Weaver · July 7, 2024, 9:13am

Had this happen to me yesterday as well, all very similar apart from it seems to have taken the storage controller on the mainboard with it (12th gen intel i7-1280p). I can no longer boot to or properly see any internally attached m.2 drive. When booting from USB everything seems to work fine. Not sure if anyone else experienced this and managed to resolve it?

Support ticket is in. Loved the laptop otherwise so hopefully can get this resolved in a meaningful way.

How is the SN850 attached: Internally
Was LUKS in use: Yes
Died after sleep or died on reboot/cold boot: After sleep
Was the drive purchased from us? If so, which date(s): Yes, order date was 28 Jan 2022, 500GB WD Black SN850 - WDS500G1X0E-00AFY0
Which Linux distro and kernel used in this instance: Pop_os! 6.9.3-76060903-generic

Alan_Pearce · July 7, 2024, 6:06pm

I suspect you will find the m.2 module is OK, seeing you can’t use any m.2 module. I’d wait until FW service resolve the problem before trashing your m.2 module.