Which release version? 24.04 LTS
(if rolling release without a release version, skip this question)
SSD Failed Twice, Maybe Temp Related
(If rolling release, last date updated?)
Which kernel are you using? 6.17.4-76061704-generic
Which BIOS version are you using? 4.02
Which Framework Laptop 16 model are you using? (AMD Ryzen™ 7040 Series) 7940HS
I bought the laptop in May, and the SSD failed in July. The SSD was the WD Black SN770 sold by Framework. Framework replaced it, and the new one failed again two months later. Failure mode both times was multiple bad sectors in the boot sector and a bad superblock detected by fsck and other tools.
StorageReview rates the MTBF as 1.75 million hours. So either something extremely unlikely occurred, or something in my environment is stressing the disk. My use case is mostly web browsing and email, with occasional software development, but not a lot of disk I/O intensive work or compiling/building, etc. So temperature was my main suspicion.
After the second drive replacement, I started collecting temperature stats with smartd. I was seeing disk temps in the high 70s and low 80s (Celsius) consistently under load. Temps at idle in the 60s. I recalled that the thermal pad attached to the midplate is pretty narrow, so I replaced it with one that covers the full width of the SSD. That brought a slight improvement in the SSD temps, but I’m still seeing load temps around 77 Celsius on the SSD, and idle temps around 50.
Am I right to be concerned about these temps? Do I need to create more aggressive fan curves? I haven’t done anything with fan curves so far, but I’m monitoring fan performance. It doesn’t come on very often, and never more than a few hundred RPMs.
Thanks, but what exactly does operating temp range mean? Could there be longevity implications of operating in the upper part of the range consistently? I received an LLM answer for SSDs in general that said 70-80 indicates poor airflow, and 80+ could induce throttling. I didn’t look into the sources for that answer, but I can. Nevertheless, I’m still left wondering why I had two disk failures in two months. I don’t want to just assume that was bad luck.
operating temperature is the safe range for an SSD to operate at that shouldn’t cause any significantly increased degradation, WD/Sandisk have a controller that runs hotter than most brands and for general in SSDs the operating temperature range is a bit lower so i could see an AI model using general temps for that because for instance crucial and samsung SSDs are typically 0-70c range. Under 85c on a WD SSD is within spec and keeps the warranty intact so they wouldn’t put the number there if it was going to increase failure significantly because that would be on them to replace for free.
There is another thread that has been “tracking” these dead Framework sold drives for almost 2 years. I haven’t seen any root cause posted by Framework yet. Buy another brand ssd if you care about your data.
I don’t mind buying another SSD, but I want to make sure there is not an underlying problem with the Framework. If the problems are isolated to the WD drives, that would be good news.
i have not been a huge fan of WD/Sandisk SSDs on paper they seem great but i have had some bad experiences with drive quality (IE dead on arrival) and premature failures, it is not a brand i even consider when buying SSDs any more. I hesitate to say they are worse than other brands only because my sample size being too limited to draw that kind of conclusion but personally i don’t consider them for builds.
I think it is worth trying a different brand and if it still fails then i would assume it is the laptop and you can contact framework about sorting everything out.
The most common m.2 drive slot failures i have seen would show up as different symptoms to what you have, either just runaway heating from a short or unreliable solder connection so the disk randomly disconnects.
I wrote a python app to track my SSD health, and I finished it just in time to see that a catastrophic failure may be imminent as it’s wracking up hard media failures. Clearly, something is wrong with my Framework 16. Three primary SSD crashes since purchasing it in May, with roughly a two-month MTTBF.
I opened a support ticket, and Framework is currently looking at my log. If anyone would like to use the disk monitoring software I wrote, you can find it here. It’s a rich UI command-line client that discovers all NVME drives and shows current health and a histogram of temperature readings.
That is a little toasty.
On the FW16 there are heat pads near the SSD that help dissipate the heat. On a new FW16 they have a platic film on them? You need to remove the platic film for them to work.
Thanks, I did remove the plastic film, after the first failure. After the second failure, I replaced it with a wider one (I noticed it only covered part of the width of the drive). Temperatures have been non-problematic since the first failure. Generally below 60 Celsius. There are numerous unsafe shutdowns and media errors in the smart log, even though I have not done any hard shutdowns and there have not been any crashes or hangs.
I’ll see what Framework says, but it seems to me there is likely an electrical problem. Noteworthy that it only happens to the primary slot. The secondary disk has never had any problems.
These seem to get generated by the OS telling the drive to go into power save mode when nothing is happening.There have been a number of threads about this happening and the way ‘unsafe shutdown’ messages grow exponentially in the logs.
That makes sense, because I have unsafe shutdown counts for the good disk also. The real issue is the media_error counts, which started suddenly and began increasing rapidly until I booted to the secondary and mounted the primary disk read-only.
The system is popping up a warning daily that a catastrophic SSD failure may be imminent. The kernel log also shows critical hardware errors with the drive. Still waiting to hear back from Framework after submitting the logs they requested.
Yeah, that’s the command that I’m running every 5 minutes as a service and then monitoring with a client app. It’s what alerted me to the problem. Here is current output. Media errors were 85 when I first noticed them. By the time I rebooted to the secondary, they went up to 112. The disk has been mounted read-only for several days now, so the error count is stable.
sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0x4
temperature : 39 °C (312 K)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
endurance group critical warning summary: 0x4
Data Units Read : 5487928 (2.81 TB)
Data Units Written : 8771197 (4.49 TB)
host_read_commands : 251365795
host_write_commands : 55578575
controller_busy_time : 262
power_cycles : 50
power_on_hours : 160
unsafe_shutdowns : 12
media_errors : 112
num_err_log_entries : 112
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 52 °C (325 K)
Temperature Sensor 2 : 39 °C (312 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0