Hi folks, I have one of my specialists working to repro for Fedora and Arch here.
Other distros, we will recommend a log capture and a bug submission.
Thanks
Hi folks, I have one of my specialists working to repro for Fedora and Arch here.
Other distros, we will recommend a log capture and a bug submission.
Thanks
Perhaps you, or your specialist, might invest some time to browse through the issues section of your “SoftwareFirmwareIssueTracker”-repository on Github. There is a lot of valuable information there already, interesting to anybody truly interested in fixing the issue, i imagine. Please, keep us updated, Matt. Thanks a bunch in advance..
There are essentially 3 different issues that result Random halts or Random reboots. They all create no logs except the S5_RESET_STATUS.
They seem to affect both the FW13 AMD and FW16 AMD.
I have been told by AMD that these can only be resolved by a manufacturer, i..e FW, talking directly with AMD to resolve them. Users cannot help any further with these problems.
The main problem is reproducibility. Laptops can sometimes go for a whole month without a problem and other times it can happen several times a day.
At least the S5_RESET_STATUS proved it was a low level problem and not a device driver problem.
My theories as to the cause of these is one of:
Also note, this problem is not a FW specific problem. Other manufacturer’s laptops have reported similar:
Asking for logs won’t help on this one because there are no logs, save for the one added by people patching their kernel.
Now you’re starting to get my drift.. Almost funny if it wasn’t so sad.
Anyhow, IMO at this point it strongly feels like it is either a case of “can not” or “will not”, no matter how detailed the reports.
There is plenty of information on these forums and again in github issues. It’s been known for months ( 6-8 months ) that we’ve ( FW16 and more recently FW13 users ) been unable to get any logs ( except reset status ) as there isn’t any after the issue occurs
Howdy!
Someone else did link the Github issue tracker link in a related thread so I’m going over that now. This issue is going to the top of my priorities list and I’m going to first focus on replicating this issue. As everyone is noting, this is a particularly difficult issue to pin down because it leaves behind little actionable information to work off of.
I’m hopeful I’ll be able to make some faster progress on this for everyone. With the nature of its issue and how much testing it may take to replicate I just want to say I appreciate everyone’s patience in the meantime. I’m no stranger to these sorts of issues on various mobile platforms over the years so I understand how frustrating it is.
@Jesse_Darnley
It is great that you are looking into this.
I have been a Linux kernel developer for over 10 years. You have picked a particularly difficult problem to solve.
My advice to you would be for FW to talk to AMD regarding the S5_RESET_STATUS and find out what kind of support you will need from the hardware and BIOS teams to track this one down.
At the same time, see if anyone in your office has seen any of the reported problems at all.
If the answer is none of them, it might be necessary to ask for the mainboard back from any of the users who have reported the problem. In case the problem is my item
4: “Faulty CPU, for this case, a simple replacement mainboard might fix it.”
Hello Jesse,
thank’s a bunch for having a look into this. It appears to me that you, or anyone from framework support for that matter, have for the first time indeed read about and understood the issue and frankly, this is all i have ever asked for, admittedly, in a continuously descending tone of voice.
Yes, as @James3 has pointed out, this is a pretty nasty one, and it might perhaps even be the case that this issue is out-of-scope for framework engineering, depending on what the root cause might turn out to be, e.g. AMD platform mgmt. issue rather than EC/firmware. But i am obviously out of my depth on this one.
The functionality of the patch that @James3 has kindly provided, allowing to read out the value of the “S5_RESET_STATUS” register at next boot, has been upstreamed into mainline linux [1], iiuc, for the 6.16 release.
This means that FW support might ask users for their reading of this value in respect to “sudden reboot” or “freeze-than-reboot” issues, helping to pinpoint the root cause, once 6.16 will hit the repositories of FW supported distributions, of course.
Again, i’d like to thank you sincerely for taking the time to have a look into this, even if it means FW won’t be able to provide a fix in the near future, considering that this might be caused by quite a lot of different and complicated subsystems, unfortunately, perhaps even in conjuncture.
But as @Will_Nilges has pointed out, it is very likely that this is not just concerning phoenix platforms, but newer platforms as well, so it might be of interest of getting to the bottom of this one.
[1] Making sure you're not a bot!
P.S: This patch appears to be authored by the kernel developer Mario Limonciello from AMD, who is also quite active around these forums. As @James3 has already pointed out, this (S5_RESET_STATUS register) might perhaps be the most promising starting point for getting into this, although, i imagine that he (@Mario_Limonciello) is already getting quite a lot of requests and surely, his days must also be counting 24 hours, even though it might often not appear to be the case, you know:)
This is indeed from a 6.16.0-rc1 kernel (without applied said patch):
Jun 11 15:19:51 fw kernel: x86/amd: Previous system reset reason [0x00080800]: software wrote 0x6 to reset control register 0xCF9
Hello,
It’s all good, like I said I understand the frustration, and I’m happy to get this moving forward in a positive manner. One of the biggest challenges we’ve had trying to address this previously was the inability to get these crashes to happen in a consistent manner on our own hardware.
I come bearing a small bit of good news. I’ve been able to find a way to force this crash to essentially happen on demand on our AI 300 mainboard. I noticed a couple of users in other threads mention the crash was more likely if their system changed charging state while running. Sure enough, if I plug in the system to AC power exactly as it enters suspend, that consistently crashes it every time.
I’m hoping we’re able to make good use of this and find a fix for the issue, and hopefully I’m able to deliver another update on this relatively soon.
Ok, thanks @Jesse_Darnley
I don’t want to bloat this thread up too much ( I’ve done enough of that else where ) but I’ll add this piece because you mentioned
As an FW16 user, the github issues ( that @James3 has mentioned earlier ) in my experience, the FTR/FTH ( mostly FTH ) occurs more frequently when I change from AC to DC and then regardless if I go back back to AC or not, system is less stable until I reboot ( someone else on this forum mentioned this too ). Before rebooting I put it back into AC. It’s not consistent but is more frequent.
I’m not saying to ignore others but do please look at what @James3, @Adrian_Joachim ( there someone else but I forgot their name ) have also said, they have good details here on this forum and on github issues.
Hi Jesse,
please, take your time and don’t rush..
I think it is important to differentiate possible issues, e.g. there appear to be occurances for FW16 users, where issues (FTH/FTR) arise while using the machine, some while connected to ac, some not.
FW16 users also appear to experience a FTH (freeze-then-halt) issue, where the machine apparently freezes, does not react to acpi events anymore, and has to be hard reset via power button.
Personally, i can only speak for my FW13, where the issue (FTR/sudden reset/no logs) solely happens while resuming from suspend. I never have been able to isolate the root cause or specifiy circumstances of this happening, not to speak of reliably triggering the issue. FW support asked me to record the issue on video, which meant me having to record my laptop on every wake from suspend until the issue occured again. Take it from the following log, how long this might have took me..
Last months:
Feb 21 12:33:11 fw kernel: S5_RESET_STATUS = 0x08000800
Mar 02 15:45:20 fw kernel: S5_RESET_STATUS = 0x08000800
Apr 08 14:22:50 fw kernel: S5_RESET_STATUS = 0x08000800
Apr 30 21:30:00 fw kernel: S5_RESET_STATUS = 0x08000800
Going strong since the last BIOS update with 41 days since the last time the laptop reset itself:)
But i surely can imagine these issues (FW13/FW16,7040/ai300) being related, considering that, imo, there is/are an issue(s) with System Management Mode or the Embedded Controller, seeing the kernel not being involved in this.
And thanks again for keeping us updated..
Keep in mind that Triple Fault is much more predictable, as it is caused by sleep-related … things.
But excellent compilation.
@sydney
If on random it Freeze then Hang (FTH) it’s probably Sync Flood.
As you can see on the github issues, I also “am experiencing” the problem. Except I have determined that mine is due to Triple Fault, so i just don’t sleep. Can’t say I have ever ran into a issue otherwise.
Seems like Sync Flood is triggered by some bad power state switching (e.g., idle low power on desktop vs high power in gaming). Especially in low power state.
AFAIK there’s no relation between 7040 and AI series. But yes, FW13 and FW16 is related. In fact 7840U/7840HS/7940HS use the same identical silicon, just binned differently. However I still expect the design to be different.