NVME SSD Monitoring Software

The SSD on my Framework Laptop 16 failed three times in six months. Framework replaced the disk each time, and after the third failure, they replaced the mainboard, the apparent cause of the failures. When I realized I had a recurring SSD issue, I developed an app for tracking NVME SSD health. It’s a Linux Python app, and it’s available here for anyone who’d like to use it.

It consists of two components:

  1. A systemd service that monitors SSD health using the nvme-client Linux package and writes data periodically to a log file.
  2. A command -line client that displays a summary of SMART data health info and a histogram of disk temperatures. The client can also run headless, installed as a systemd service, doing background monitoring of the log files and providing configurable email alerts.

There are several moving parts, but I think it will be straightforward to install. It discovers NVME drives automatically, lets you sort and scope the temperature histogram, and tab between separate display pages for each drive. There are configuration parameters to do automatic log pruning and archiving, and you can de-bounce email alerts, receive periodic “healthy” notifications, and also get a notification if the collector service stalls or fails.

I had fun writing it, and hopefully some will find it useful. Here’s a screenshot of the command line client. As witnessed by the media_errors count, this was the beginning of the third (and hopefully final) disk failure. The health-score is a 0-100 roll-up score based on a simple algorithm of weighted SMART data values.

2 Likes

Hi,

I came to the community board today with a question that relates to this, so instead of creating a separate topic, I’ll leave my question here, whether for you or for someone else:

Note that I am not yet a Linux user, but am considering it.

Regarding the tendency of SSD’s to just die with little or no warning (from what I’ve heard): is there a software sold by Framework or some other computer company that is easy to install and use and which allows a person to know pretty clearly when the SSD is about to die? By “easy” I mean: It’s easy for someone who will never do anything with the terminal/command line, and who is very non-technical. Background to this includes that I tried to download and run some sort of SMART app on my apple macbook air the other day (I am just getting to know the Apple system after decades of using Windows) and I wasn’t really able to run it in a way that gave me that much useful-seeming information about my hard drive.

@JLAZ

Although SSDs have SMART monitoring tools, they do not actually warn you about many of the multiple failure modes of an SSD.
The only really sensible option is to backup your data so it does not matter when the SSD fails.
The only thing SMART tools can help with an SSD is to tell you how close to its read/write limit it has reached because you can only do a limited amount of read/write cycles on a sector of the SSD.
It is not like the old HDDs, where there were cases where the media would degrade, and so the SMART could warn when things were about to fail.

1 Like

Failure modes are not predictable, but I can tell you that the software I wrote enabled me to catch the third SSD failure before I had any noticeable data loss. I wrote the software because I knew the disk was going to fail again, and I wanted to collect the history of SMART data to see what was going on and demonstrate it to Framework. I got an alert that there were 85 media errors, which I knew from previous failures were likely physical failures of the SSD storage, i.e. not software-caused or software-recoverable. I immediately rebooted to my secondary disk, mounted the primary disk read-only, and created a disk image from it that I was able to restore to the new SSD when it arrived.

I say “noticeable” data loss because those 85 media errors represent real loss of whatever was stored in those locations. Either there was nothing stored there or what was stored there was not significant.

The software I wrote just provides continual collection of SMART data and various capabilities for viewing it and receiving alerts. The underlying Linux package is nvme-cli, and you can invoke it any time to get a real-time indicator of disk health. I imagine there are similar programs for Windows.

One additional advantage of continuous monitoring is to see if there are environmental conditions that are stressing the disk. In my case, I wanted to see if there was a heat dissipation problem that was causing the repeated failures, and it turned out there was not.

2 Likes

Thanks. I recently took in an old retired beat up windows laptop that I had (I had already switched to an apple laptop) because it seems worth fixing a couple of things. The repair lady told me that the drives both look like thye are near the read/write limits, which makes sense based on its age and use. So, this will determine whether or not I can use the SSDs if I want to retrofit the computer with Linux, or whether I should have them put in new ssds.

Also, I was quite freaked out about belatedly seeing the videos about Apple’s use of soldering in their SSDs, so I guess once the drive on this computer goes, that will be that? I dunno. It just got me to thinking about being more aware of whether the drive is near end-of-life. But when I went to use the SMART software, I am not able.