`smartctl` shows excessive power cycles and unsafe shutdowns

I searched about the condition and found this “shutdown notification


I think if the SSD requiring the machine to explicitly send a “notification”, or else increment “unsafe shutdown” by 1, it’s too sensitive.
If you, hypothetically, only power on the SSD by connecting the VCC and GND, then power off by removing power to VCC, the “unsafe shutdown” will still increase even if you did absolutely nothing in terms of data.
This could explain why as mundane as adjusting keyboard backlight causes it to increase, the backlight LED power and the VCC power to SSD is likely powered by the same rail.

Even with the bug mentioned above, the counter shouldn’t add 400+ within a day. Something is definitely broken either the SSD itself, or the mainboard powering the SSD

Hey there,
I’m having a similar issue here on my FW 13 Ryzen 7 350, about 3 weeks old.

I have two drives with two OSes: One Kioxia G3 1TB internally for my Fedora 42 installation and a 250gb Expansion Card with Windows 11 installed.

I took a look today at crystaldiskinfo in windows and saw that my Kioxia has about 280 power on hours (that tracks) and 4014 power cycles with 3819 unsafe shutdowns (that doesn’t track).

The Framework expansion card running windows has about 80 cycles with 20 unsafe shutdowns which may or may not be accurate. I sometimes leave that plugged in when booted into Fedora but never really mount it, I usually take it out though.

It too appears to me that the issue resides with my Fedora installation but I can’t exactly see what’s causing the problem, just figured I might chime in as another data point. I’d be glad to help troubleshoot this given guidance (I’m fairly well versed with using Linux).

Two other issues I’m having with Fedora that may or may not be related to the drive disconnecting is that sometimes the device may Kernel Panic when left idle (with the lid on mind you, idk if it goes to sleep since I’ve only caught it a few times with it left open after leaving home for a while).
The other is that sometimes when waking from sleep the Goodix fingerprint sensor will no longer work or show up in lsusb, requiring a full reboot for it to show up again.

I’ll be keeping an eye for replies and potential help or ideas to try by myself.

1 Like

Sorry to hear that, and thanks for providing your case. I’m using btrfs and is occasionally btrfs scrub and subvolume backups for if there’s anything wrong I may recover more quickly.

I’ve opened a ticket and have been contacting with them for a while, and we haven’t had any “ah-ha that’s it” moment and are still taking different experiments. I’m currently experimenting that if setting power-profiles-daemon to different power mode helps (I used to use the KDE GUI to automatically switch to powersave if on battery and balanced otherwise; I’m using always balanced now as an experiment), plus also taking an eye on if generic poweroff/reboot affects the counters, too.

Speaking for myself: dmesg has little to say on this issue, since every time I find smartctl reporting irregular values again is after I wake the laptop from s2idle, during which the user space and the kernel inherently have fewer grasp of the whole picture since well, s2idle. I guess one has to be equipped with decent kernel driver and ACPI knowledge to debug such issues. One thing that’s for certain is that power cycles and unsafe shutdowns seem always come hand in hand with each other.

Some guy from other community had suggest patching DSDT to forcibly enable s3/suspend-to-ram and see if ultimately it’s just yet another case demonstrating all the manufactures out there are not good at implementing s2idle, but I’m not sure if it’s worth it and have not tried that out, since it’s not supported by either AMD or Framework after all.

Same issue here, with btrfs on dm-crypt on luks on nvme. However, since rebooting into the latest 6.15.7 kernel, the issue has not been experienced (Edit: last suspend/wake cycle jumped the count from 51179 to 52000! The issue persists with 6.15.7.). Fingers crossed it stays that way! I racked up an insane number of unsafe shutdowns over the last 2 months:

# smartctl -a /dev/nvme0n1 | grep "Unsafe Shutdowns"
Unsafe Shutdowns:                   51,176

I’ve made a bunch of cmdline tweaks, though:

root=/dev/mapper/vg-main rw rootflags=subvol=@ usbcore.quirks=27c6:609c:0x40 rd.luks.name=[uuid]=cryptroot rd.luks.options=[uuid]=tpm2-device=auto,tpm2-measure-pcr=yes,discard,tries=3 lsm=capability,landlock,lockdown,yama,apparmor,bpf mitigations=full nvme_core.default_ps_max_latency_us=25000 rtc_cmos.use_acpi_alarm=1 audit=1 iommu=pt

Notably the max latency prevents the drive from entering power state 5:

# nvme id-ctrl /dev/nvme0n1 | grep -A6 '^ps '
ps      0 : mp:4.60W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:0.2200W active_power:4.60W
            active_power_workload:80K 128KiB SW
            emergency power fail recovery time: -
            forced quiescence vault time: 10 (unit: 1 second)
            emergency power fail vault time: -
ps      1 : mp:3.00W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:0.2200W active_power:3.00W
            active_power_workload:80K 128KiB SW
            emergency power fail recovery time: -
            forced quiescence vault time: 10 (unit: 1 second)
            emergency power fail vault time: -
ps      2 : mp:2.50W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:0.2200W active_power:2.50W
            active_power_workload:80K 128KiB SW
            emergency power fail recovery time: -
            forced quiescence vault time: 10 (unit: 1 second)
            emergency power fail vault time: -
ps      3 : mp:0.0200W non-operational enlat:2000 exlat:3000 rrt:3 rrl:3
            rwt:3 rwl:3 idle_power:0.0200W active_power:-
            active_power_workload:-
            emergency power fail recovery time: -
            forced quiescence vault time: 10 (unit: 1 second)
            emergency power fail vault time: -
ps      4 : mp:0.0050W non-operational enlat:4000 exlat:12000 rrt:4 rrl:4
            rwt:4 rwl:4 idle_power:0.0050W active_power:-
            active_power_workload:-
            emergency power fail recovery time: -
            forced quiescence vault time: 10 (unit: 1 second)
            emergency power fail vault time: -
ps      5 : mp:0.0030W non-operational enlat:176000 exlat:25000 rrt:5 rrl:5
            rwt:5 rwl:5 idle_power:0.0030W active_power:-
            active_power_workload:-
            emergency power fail recovery time: -
            forced quiescence vault time: 10 (unit: 1 second)
            emergency power fail vault time: -

I also made a monitor script that may help you:

!/bin/bash

LOG_FILE="/var/log/suspend_monitor.log"

log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}

get_nvme_power_info() {
    local nvme_power_dir="/sys/block/nvme0n1/power"

    if [[ -d "$nvme_power_dir" ]]; then
        local control=$(cat "$nvme_power_dir/control" 2>/dev/null || echo "unavailable")
        local status=$(cat "$nvme_power_dir/runtime_status" 2>/dev/null || echo "unavailable")
        local autosuspend=$(cat "$nvme_power_dir/autosuspend_delay_ms" 2>/dev/null || echo "unavailable")

        echo "Control: $control, Status: $status, Autosuspend: ${autosuspend}ms"
    else
        echo "NVMe power directory not found"
    fi
}

get_nvme_apst_info() {
    if command -v nvme >/dev/null 2>&1; then
        local apst_info=$(nvme get-feature -f 0x0c -H /dev/nvme0n1 2>/dev/null | grep -i "autonomous\|apst" || echo "APST info unavailable")
        echo "$apst_info"
    else
        echo "nvme-cli not available"
    fi
}

get_unsafe_shutdowns() {
    if command -v smartctl >/dev/null 2>&1; then
        local unsafe_count=$(smartctl -a /dev/nvme0n1 2>/dev/null | grep -i "unsafe shutdown" | grep -o '[0-9,]*' | tr -d ',' || echo "unavailable")
        echo "$unsafe_count"
    else
        echo "smartctl not available"
    fi
}

# Store initial unsafe shutdown count
INITIAL_UNSAFE=$(get_unsafe_shutdowns)

log_message "=== Suspend Monitor Started ==="
log_message "NVMe Power State: $(get_nvme_power_info)"
log_message "NVMe APST Info: $(get_nvme_apst_info)"
log_message "Initial Unsafe Shutdowns: $INITIAL_UNSAFE"

# Monitor suspend/resume events
journalctl -f -u systemd-suspend.service -u systemd-resume.service --since="1 minute ago" | while read -r line; do
    if [[ "$line" =~ (suspend|resume) ]]; then
        current_unsafe=$(get_unsafe_shutdowns)
        log_message "Event: $line"
        log_message "NVMe Power State: $(get_nvme_power_info)"
        log_message "Unsafe Shutdowns: $current_unsafe"

        # Alert if unsafe shutdowns increased
        if [[ "$current_unsafe" != "unavailable" && "$INITIAL_UNSAFE" != "unavailable" ]]; then
            if (( current_unsafe > INITIAL_UNSAFE )); then
                log_message "⚠️  WARNING: Unsafe shutdowns increased from $INITIAL_UNSAFE to $current_unsafe!"
            fi
        fi
    fi
done

Folks, we are actively tracking this. However, unless you are very clear on specs, your data is not helping us better understand what you’re experiencing. Distro and kernel are great, but laptop model and specs are needed as well. Thanks.

Laptop, RAM and Nvme model please as well.

1 Like

Laptop model, RAM and drive specs please.

Some of our platforms have SSD D3Cold support. This allows the mainboard to cut power to the SSD entirely. This should be supported in windows/linux.

Which may be causing the unsafe shutdown count to increase on your SSD.

You could try setting the kernel flag
nvme.noacpi=1

To see if this can stop the unsafe shutdown count as a first debug step.

3 Likes

Which platforms support that, and would that flag cause issues on those that do not? For the 13" I have 11th and 12th gen Intel boards and an AMD 7640, in the 16" I have the AMD 7940, and in the 12" I have the Intel 13th gen i3. I typically boot off of an expansion card that I move between machines, so would not want to set a kernel parameter that caused issues with some platforms. Thanks!

  • AMD AI 340
  • SK hynix Platinum P41 1TB
  • Crucial CT2K16G56C46S5
  • NixOS 25.05
  • Framework 13 with AMD Ryzen 7 7840U
    • Firmware v3.09
  • 2x 16GB A-DATA DDR5 5600 MT/s AD5S560016G-B
  • WD_BLACK SN770 1TB
    • Firmware 731120WD
  • Arch Linux, currently running linux kernel 6.15.7
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,034,366 [529 GB]
Data Units Written:                 6,886,308 [3.52 TB]
Host Read Commands:                 11,899,287
Host Write Commands:                136,022,334
Controller Busy Time:               179
Power Cycles:                       2,814
Power On Hours:                     343
Unsafe Shutdowns:                   809
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Power Cycles increments whenever I suspend/resume. Unsafe Shutdowns about 1/3 of the times, I guess? I don’t think I’ve fully rebooted this thing 800 times. Also I’ve never seen signs of disk or fs corruption.

I can no longer edit this post, but my hardware/software stack is described below.

Hardware stack:

Laptop 13 (AMD Ryzen AI 300 Series)
GPU: c1:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Krackan [Radeon 840M / 860M Graphics] (rev c2)
Storage: bf:00.0 Non-Volatile memory controller: Sandisk Corp WD_BLACK SN7100 NVMe SSD (DRAM-less) (rev 01)
    nvme0n1: :
WiFi: c0:00.0 Network controller: MEDIATEK Corp. MT7925 (RZ717) Wi-Fi 7 160MHz
RAM: 32 GB DDR5 @ 5600 MT/s

Software:

Kernel version: 6.15.4-arch2-1
Desktop Environment: sway (wayland)
Distribution: Arch Linux
BIOS Version: 03.03

Since my last report, I have not shut down or restart (uptime is 17 days), but “Unsafe Shutdowns” has increased from 809 to 821

Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,034,437 [529 GB]
Data Units Written:                 6,947,973 [3.55 TB]
Host Read Commands:                 11,901,443
Host Write Commands:                137,301,470
Controller Busy Time:               181
Power Cycles:                       2,856
Power On Hours:                     347
Unsafe Shutdowns:                   821
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0

Same thing happening here. Have had laptop for a couple months and smarctl shows 4000 cycles and 3800 unsafes.

Specs:
FW13 Ryzen 7 350
2x16 5600MHz Kingston Fury
1TB Kioxia Exceria G3 nvme (contains Fedora 42 with latest updates)
250GB Framework expansion card (contains Windows 11 in togo mode, works fine and doesn’t seem to cause any power cycle issues)

Ok this is interesting: I removed Linux from my internal ssd and cloned over my windows installation from the ssd expansion card. Took a screenshot of CrystalDiskInfo right after installing it about a month ago and lo and behold, power counts have increased by about 900 (I have NOT rebooted or slept this machine anywhere near 900 times) and Unsafe Shutdowns have gone from 0x1417 to 0x156A (339 times in decimal, so ~11 times a day, again nowhere near any kind of power cycle I do myself). I have upgraded to the latest firmware and no change has been made.

This sounds like a serious firmware issue to me, or Linux and Windows happen to be having a similar problem. Either way, I feel like this needs to be investigated further because I’m not entirely certain cycling the ssd or doing unsafe shutdowns on it is a good thing at all for the drive.

1 Like

I have now confirmed this happens on at least one more identical framework device with the same kioxia ssd running windows 11. It might just be normal but I’m still not sure this should be happening.

I have the same increment in the counter too!

My Framework:

Manufacturer: Framework
Product Name: Laptop 13 (AMD Ryzen 7040Series)
Version: A7
BIOS Revision: 3.16
SKU Number: FRANDGCP07

My SSD:

Model Number: Samsung SSD 990 PRO 4TB
Firmware Version: 4B2QJXD7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538

Before the system suspends:
Power Cycles: 3,157
Power On Hours: 2,090
Unsafe Shutdowns: 2,405

After coming back from suspend:
Power Cycles: 3,159
Power On Hours: 2,090
Unsafe Shutdowns: 2,406

One suspend created 2 power cycles, and 1 unsafe shutdown.

Currently on Gentoo and Linux kernel 6.17.1

The kernel option nvme.noacpi=1 totally killed the suspend. I could not get my system out of suspend, the power button LED stopped flashing, but screen stayed black.

Same issue:
=== START OF SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)

Critical Warning: 0x00

Temperature: 39 Celsius

Available Spare: 100%

Available Spare Threshold: 5%

Percentage Used: 19%

Data Units Read: 820,123 [419 GB]

Data Units Written: 1,603,756 [821 GB]

Host Read Commands: 8,158,333

Host Write Commands: 22,040,628

Controller Busy Time: 1

Power Cycles: 111,644

Power On Hours: 100

Unsafe Shutdowns: 111,081

Media and Data Integrity Errors: 0

Error Information Log Entries: 0

Warning Comp. Temperature Time: 0

Critical Comp. Temperature Time: 0

Temperature Sensor 1: 39 Celsius

I have a 13 with the 370 HX and a crucial p310. Same issue as the poster above with nvme.noacpi=1