Which Linux distro are you using? Pop!_OS
Which release version? 24.04 LTS
(if rolling release without a release version, skip this question)
SSD Failed Twice, Maybe Temp Related
(If rolling release, last date updated?)
Which kernel are you using? 6.17.4-76061704-generic
Which BIOS version are you using? 4.02
Which Framework Laptop 16 model are you using? (AMD Ryzen™ 7040 Series) 7940HS
I bought the laptop in May, and the SSD failed in July. The SSD was the WD Black SN770 sold by Framework. Framework replaced it, and the new one failed again two months later. Failure mode both times was multiple bad sectors in the boot sector and a bad superblock detected by fsck and other tools.
StorageReview rates the MTBF as 1.75 million hours. So either something extremely unlikely occurred, or something in my environment is stressing the disk. My use case is mostly web browsing and email, with occasional software development, but not a lot of disk I/O intensive work or compiling/building, etc. So temperature was my main suspicion.
After the second drive replacement, I started collecting temperature stats with smartd. I was seeing disk temps in the high 70s and low 80s (Celsius) consistently under load. Temps at idle in the 60s. I recalled that the thermal pad attached to the midplate is pretty narrow, so I replaced it with one that covers the full width of the SSD. That brought a slight improvement in the SSD temps, but I’m still seeing load temps around 77 Celsius on the SSD, and idle temps around 50.
Am I right to be concerned about these temps? Do I need to create more aggressive fan curves? I haven’t done anything with fan curves so far, but I’m monitoring fan performance. It doesn’t come on very often, and never more than a few hundred RPMs.
those temps are within the safe operating temperatures of the sn770 of 0ºC to 85ºC 250GB WD_BLACK SN770 NVMe™ SSD | Sandisk
Thanks, but what exactly does operating temp range mean? Could there be longevity implications of operating in the upper part of the range consistently? I received an LLM answer for SSDs in general that said 70-80 indicates poor airflow, and 80+ could induce throttling. I didn’t look into the sources for that answer, but I can. Nevertheless, I’m still left wondering why I had two disk failures in two months. I don’t want to just assume that was bad luck.
operating temperature is the safe range for an SSD to operate at that shouldn’t cause any significantly increased degradation, WD/Sandisk have a controller that runs hotter than most brands and for general in SSDs the operating temperature range is a bit lower so i could see an AI model using general temps for that because for instance crucial and samsung SSDs are typically 0-70c range. Under 85c on a WD SSD is within spec and keeps the warranty intact so they wouldn’t put the number there if it was going to increase failure significantly because that would be on them to replace for free.
1 Like
There is another thread that has been “tracking” these dead Framework sold drives for almost 2 years. I haven’t seen any root cause posted by Framework yet. Buy another brand ssd if you care about your data.
1 Like
I don’t mind buying another SSD, but I want to make sure there is not an underlying problem with the Framework. If the problems are isolated to the WD drives, that would be good news.
i have not been a huge fan of WD/Sandisk SSDs on paper they seem great but i have had some bad experiences with drive quality (IE dead on arrival) and premature failures, it is not a brand i even consider when buying SSDs any more. I hesitate to say they are worse than other brands only because my sample size being too limited to draw that kind of conclusion but personally i don’t consider them for builds.
I think it is worth trying a different brand and if it still fails then i would assume it is the laptop and you can contact framework about sorting everything out.
The most common m.2 drive slot failures i have seen would show up as different symptoms to what you have, either just runaway heating from a short or unreliable solder connection so the disk randomly disconnects.
1 Like