[RESPONDED] dGPU lockup from rear USB port

jared_kidd · April 2, 2024, 5:42am

I am having an issue with the dGPU locking up if I connect a display to the rear port after booting up. I run arch Linux as my main OS but also have Fedora 39 installed for further testing. The same results occur in either OS.
If I boot up with nothing connected, and then connect a display to the rear port, dmesg does show a connection like so, but the additional monitors do not work:

[  142.939856] usb 1-2.4: new full-speed USB device number 11 using xhci_hcd
[  143.078727] usb 1-2.4: New USB device found, idVendor=32ac, idProduct=0002, bcdDevice= 0.00
[  143.078743] usb 1-2.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  143.078749] usb 1-2.4: Product: HDMI Expansion Card
[  143.078753] usb 1-2.4: Manufacturer: Framework
[  143.078757] usb 1-2.4: SerialNumber: 11AD1D00A49C4014081E0B00
[  143.138075] hid-generic 0003:32AC:0002.000C: hiddev101,hidraw11: USB HID v1.11 Device [Framework HDMI Expansion Card] on usb-0000:c4:00.3-2.4/input1

However, if I then try to launch any graphics applications, games, or even nvtop, they just hang. This also causes the system to become pretty unstable and will hang on poweroff/reboot. If I then try to power back on immediately, the laptop will not boot (which scared the crap out of me the first time). But If I wait at least 30 seconds or so when powered off, it will power back on properly.

If I power on and boot up with the rear display(s) connected, it functions just fine and I can even remove them and reconnect without issue.

Can anyone else confirm this is happening to them? Or do I possibly have a busted dGPU and need to contact support?

Daniel_I · April 3, 2024, 3:09pm

I’ll try testing your scenario in abit… Just for confirmation, it looks like you are using an HDMI module, so that will be one difference between our tests as I only have DP modules.

In the mean time, I’ll share some other settings that i set while I was troubleshooting my own issue(s)… thought being that if you could try them that our baselines would be closer together:

In GRUB, I added amdgpu.runpm=0 to prevent the dGPU from sleeping

GRUB_CMDLINE_LINUX_DEFAULT="udev.log_priority=3 amdgpu.runpm=0 sysrq_always_enabled=1"

I also added amdgpu to the modules section of /etc/mkinitcpio.conf\ (arch-based) to pre/early load the kernel driver…
MODULES=(amdgpu)

Between these two settings, I think #1 could be illuminating… does your system respond the same when the dGPU is not sleeping

jared_kidd · April 4, 2024, 5:13am

I did some further testing and this prevents it from locking up if I plug into the rear port after booting up. Unfortunately, it also adds 6-10W of additional load on the battery while unplugged because the dGPU never goes to sleep. Unacceptable in my mind.

I wish Framework would comment on this issue as it is a pretty major bug.

Daniel_I · April 4, 2024, 1:50pm

hmm, well, I’m normally connected to AC (living room PC), but according to my conky that’s monitoring cat /sys/class/drm/card1/device/hwmon/hwmon*/power1_average | awk -v OFMT="%3.2f" '{print $1/1000000 " W"}' the watts bounce between 0-1 W and only increase when I launch a steam game using DRI_PRIME=1 %command%… so strange that you’d be seeing 6-8W… perhaps we are measuring differently?

Also, for the time being, I have stopped using the dGPU port entirely. It feels obvious from my thread (and yours) that using the dGPU port brings problems; and I’ll test suggestions/fixes as FW responds to my thread. So I’m curious if my dGPU watts are measuring lower for me because the dGPU port is empty?

There are some days when I feel I’ve paid to beta test a new product… I need more days where I can just enjoy using a fully functional FW16… If I can’t get to that point soon, I may have to consider the 30 day return policy.

jared_kidd · April 4, 2024, 4:46pm

Perhaps. I’m by no means doing any exhausting testing on this. I didn’t save my results yesterday, so I just did it again to provide you some screenshots.
My simple method is boot up into KDE/Plasma on battery, close steam and bitwarden (two things I have in autostart) to get to a basic desktop environment with minimal extras running.

With no sleep (dGPU in “D0” mode):

Then with the “amdgpu.runpm=0” removed (dGPU in “D3cold” mode):

In this simple test, it is a 8w difference. Nothing connected to laptop beyond my mouse dongle and the ethernet adapter, screen brightness at 100% and nothing else going on or open on the desktop. Kept it as close to the same as I could.

In any case, I use my laptop more as a laptop, so the deep sleep is more important to me.

Daniel_I · April 4, 2024, 5:04pm

I’m not really a powertop user, but the difference I see between the two images is the MediaTek Wireless device (which I assume is Wifi) at 9.85W versus 110mW.

amdgpu lists at 0mW and 200uW (0.2mW) between the two images.

So to my eye, the difference for the battery drain delta was the Wifi.

I have no doubt that the dGPU has a draw when it’s not sleeping, but if idle D0/D3hot is down in the sub 1W range… that’s likely a more acceptable delta compared to 0mW D3sleep.

What’s also confusing based on what I’ve read (see example quote below) is that simply probing the sysfs files for the dGPU also wakes it up… so if powertop is probing all the devices for details while it’s running, is the dGPU ever in D3sleep during the power monitoring?

jared_kidd · April 4, 2024, 6:37pm

I’m not here to try to prove the facts to you regarding the dGPU using more power when awake. On my laptop and in my use case, I’m just saying keeping it awake all the time is going to significantly reduce my time on battery and it is not an acceptable workaround for me.

Trying to argue that the wifi card is somehow using more power when the dGPU is awake doesn’t change the base problem.

powertop does not wake up the dGPU, that’s why I am using it to check. I just did another check while watching it with the dGPU in D3cold, powertop showing ~20w. Then launching nvtop (which /DOES wake it up), and seeing powertop shoot up to ~27w with dGPU in D0.

All these simple checks are enough proof for me. Does the dGPU somehow cause the wifi card to draw more power? Maybe, but then having the gpu sleep is still preferred so my wifi power usage drops back down as well.

Thank you for all the helpful suggestions though. They assisted in narrowing down the quirks of that rear usbc port. Hopefully Framework sees this and takes note to help make it more useful for us.

Daniel_I · April 6, 2024, 4:36pm

As a new FW16 laptop owner, I am no expert, but I am willing to explore and learn… especially in areas where I have a related issue… like the dGPU port not working as expected.

More times than not, when I chime in on forum threads I end up learning something too in the exchange, so I like to confirm what I see prior to proposing my ideas/theories. Any proposed theories of course end up being either debunked of confirmed through dialogue; leading to learning either way.

In the end if our dialogue assisted in narrowing down the quirks of that rear usbc port., then I’ll consider that an advancement of learning. Not mine, but hey, this wasn’t my post

For right now, I’m going to focus on the support ticket opened for my partners Batch 6 FW 16 Laptop (arrived a week or so after my Batch 5)… as hers arrived with a damaged/misshaped input module header. The input module plug will not stay connected leading to a “connect your input module plug into its header” message on power up… making it a very expensive paperweight for her right now until (I’m guessing) deemed innocent of creating/causing the header issue.

jared_kidd · April 11, 2024, 4:10am

Through more testing I can simplify the issue as these three undesirable functions which depend on power state:

D3cold - connect - D0: results in dGPU lock.
– plugging in a monitor to the rear dGPU usb-c port when the power state is in “D3cold” will not detect the external monitor(s). Then, if the gDPU tries to go active (switch from D3 to D0), it will lock up, and system fans will quickly ramp up to full blast, shooting hot air out the exhaust vents of the laptop until it is powered down. Only fix to this is to power off and let sit for at least 20-30 seconds.

Force D0 - connect: results in external working, disconnect will allow D3cold again.
– Holding the dGPU into D0 (on) state and then plugging into the rear port will work. However, if you make a mistake and it goes back into D3cold, and you reconnect, fans will ramp up to max, blasting heat out and it will be in same state as above.

connect - boot D0: results in external working, but D3cold is permanently blocked.
– booting up with the rear port connected to a monitor will also work, but the gDPU will never go to sleep again, even after disconnected. It remains in D0 state until the next powercycle.

Obtaining power status is from watching the sysfs file:

/sys/class/drm/card1/device/power_state

None of these bugs are desirable in the slightest, and makes using the rear port problematic at best. How could all these issues have made it past testing? The fact the gpu gets hot and causes the fans to ramp to max is worrying. Is Framework not going to comment on this? Is my machine the only one that is effected by this issue?

Daniel_I · April 12, 2024, 8:35pm

thinking back to my initial dGPU port experience, I was using the laptop post OS install without any external monitor connected.

When I connected a DP module with /cable to my monitor, it stayed blank, so I did not experience any fan ramp up like your scenario #1… however, when I rebooted with the monitor connected to the dGPU, it did display on the desktop (but not at SDDM/login, I found a way to force xrandr early to fix that, but is read herring to your issue).

I might have triggered scenario #2 through my conky, and perhaps when I rebooted with the monitor connected, it might have woken up the dGPU… can’t be sure.

scenario #3 is how I’ve left my system… However, since I cannot get a game to rum properly with an external monitor on the dGPU port (x11 and Wayland had different issues/symptoms), I’ve had to abandon the port for now and plug my external monitor on a side port (iGPU).

In short, and if my dGPU “issues” can be considered “baseline”, your fan ramp-up/lockups are definitely concerning; not what I’ve experienced.

Not sure if that points to the dGPU/port of the cable/module used. @Matt_Hartley is this somehting you could provide some feedback on?

jared_kidd · April 12, 2024, 9:00pm

This makes sense. And to clarify, the fan will not ramp up until I try to utilize the dGPU. At that point, it seems to switch out of D3cold and into D0, then freeze.

Yes, and since you are forcing the dgpu on with that kernel parameter, it avoids situation #1 and #2. This is probably the best route to take to work around the problem.

Not sure about this. I can say, in my testing, the dGPU port works beautifully as long as I avoid the D3cold situation while connecting the monitors. I have a total of 3x1440p@144hz monitors that all worked when connected to a 3-port dongle connected to that rear port. Played quite a bit of Helldivers 2 on the middle of the three this way without any problems. All functionality also works when connected to one of the side display-supporting ports as well.

Thank you for troubleshooting this as well, Daniel. Hopefully together we can get this situation sorted out. I just think not very many people have tested this port yet. Hoping it’s not only a problem with mine.

Silntknight · April 25, 2024, 4:50am

I’m also experiencing the same issue where the USB port on the dGPU module won’t provide display output when a monitor is connected after boot, but it will when plugged in at boot. I’ve also noticed two other things:

Peripherals that are plugged into my monitor (a keyboard and mouse, in my case) are passed through to my laptop even though the monitor doesn’t detect an input signal
If the rear USB display output is working (ex. after booting with it connected) and I unplug then replug it after a few seconds, it also doesn’t provide video output. Curiously, when attempting to check the power state of the dGPU (as seen in Use of USB port on GPU Expansion Module - #22 by Daniel_I) with cat /sys/class/drm/card1/device/power_state, it remains in D0

I also ran a test where I:

Booted with the monitor connected to the rear USB port (video output working)
Disconnected the port after a little while (power state still D0)
Reconnected the monitor (no video output, but power state still D0)
Attempted to run nvtop

In my case, the system did not hang, but the external monitor suddenly came to life. I haven’t yet tried this without the monitor connected at boot, where the power state may be D3cold instead of D0. I’ll give that a try later.

Regardless, though, I agree that this behavior is pretty undesirable. The expectation would be that when a monitor is connected to the port, it will behave similarly to any other display-output-capable port.

EDIT: Forgot to add some system details

Ubuntu 22.04 LTS with Wayland

Silntknight · April 25, 2024, 5:16am

Well, this test has been somewhat interesting. I attempted to replicate @jared_kidd’s scenario #1, but I was unable to.

Attempt 1:

Boot laptop with no peripherals connected
A minute after logging in, connect monitor (not launching anything else)
Check device power state (reported as D0)

At first, the external monitor didn’t do anything except pass through the peripherals connected to it, but after ~15 seconds, it suddenly turned on and started displaying as expected. I thought I might have triggered the D0 power state by checking, so I tried again.

Attempt 2:

Boot laptop with no peripherals connected
A minute after logging in, connector monitor

It happened again where simply waiting ~15-20 seconds caused everything to work as expected.

I’m not entirely sure why my dGPU didn’t display D3cold at all, but perhaps it could be caused by:

Autolaunch of Steam or Discord (or Guake)
Enabling fractional scaling (and running at 125% with no external display connected and 150% with an external display connected)
3.03 BIOS update
Launching into “Balanced” power setting
Leaving the magnetic USB-C adapter in the rear USB port (though this really shouldn’t do anything since it’s just a pass-through adapter)

Regardless, aside from taking a bit too long before displaying on the external monitor, I’m not seeing any major issues, like system hanging or complete failure to output to the external display. I just wasn’t waiting as long during my prior attempts.

jared_kidd · April 25, 2024, 4:10pm

I sent a ticket to Framework last week and they have confirmed the issue I am experiencing on their end. So hopefully a fix is in the pipeline.

You aren’t mentioning it, so not sure if your dGPU is in D3cold in any of your tests. You will not be able to replicate the issue I am describing unless your dGPU is able to sleep (D3cold). The reason for testing after a fresh boot is so it is actually in D3cold. Otherwise, the bug keeps it in D0 once you have stuff connected to that rear port.

Silntknight · April 27, 2024, 11:12pm

@jared_kidd, I mentioned in that post (in the paragraph with all the bullet points) that the dGPU never seemed to enter D3cold.

I did try with a cold boot as well, but I think I’ve identified why my dGPU doesn’t enter D3cold. I noticed that my wireless earbud case doesn’t turn on its lights when I flip it open, and I have the same magnetic USB-C adapter on it as I do on the rear USB-C port on my Framework 16. When I removed the magnetic adapter, my earbud case lit up as expected, so I suspect USB-C ports may have plug detection that doesn’t rely on an upstream/downstream device.

I’ll try testing again without the magnetic adapter installed at boot. If the dGPU shows D3cold, then this should confirm that the reason mine wasn’t entering that state in the previous tests is due to the adapter being connected.

Silntknight · April 28, 2024, 6:04pm

I tested again and I am fully unable to replicate your results, but I did find something else on my end.

With nothing connected to the laptop (all expansion cards removed, no adapters in any ports), at boot, I get the following terminal output

user@host:~$ cat /sys/class/drm/card0/device/power_state 
D3cold
user@host:~$ cat /sys/class/drm/card1/device/power_state 
D0

I guess my system is defaulting to the dGPU? I’m not sure why card0 is in D3cold but card1 is in D0 after booting.

Having said that, in this configuration, the following steps produce a few unexpected results:

Connect external display to USB-C on dGPU – Nothing happens, even after ~30 seconds
Run nvtop – Screen scaling changes, recognizing the second monitor, but does not display output to it, though peripherals are passed through
Disconnect external display – Internal keyboard no longer functions, but trackpad works

I can’t get the system to hang, probably because I can’t even get card1 to be in D3cold at all, but the non-function keyboard input is confusing. Even if I reconnect the external display with its attached keyboard, neither keyboard works.

jared_kidd · April 29, 2024, 12:57am

No, in this case your dGPU is card0, not card1. You are probably running Ubuntu (debian).

EDIT: looked up and see “Ubuntu 22.04 LTS with Wayland”. So yeah, yours are reversed.

Silntknight · April 29, 2024, 2:29am

So Arch and Ubuntu list the cards oppositely? I would not have expected that…

I think the root of the freezing (on your end) may be Arch-specific. I am seeing some issues, namely the dropout of keyboard input from any source, but not the system freeze so I’m thinking there’s an underlying issue but it manifests in an OS-specific way.

jared_kidd · April 29, 2024, 6:18am

Just Debian distros that are backwards is my understanding. Issue is not Arch specific at all. Read up I posted that Framework (and I) have verified the problem already on Fedora.

Matt_Hartley · May 2, 2024, 11:11pm

Connecting dGPU to the external display using dGPU port on the back? If this is what you are looking to do, there is a bug where the external display remains black ONLY when connected to the dGPU USB-C slow (HDMI or DP expansion card).

I do have a workaround, which uses a quick instance of running nvtop then closing it, all hidden and behind the scenes.

Basically it does a little udev magic to scream “Look, new USB is attached! Better run nvtop for a few seconds then close nvtop” - I do this as timeout 2 nvtop.

echo -e '#!/bin/bash\n\necho "USB device connected. Running nvtop for 2 seconds."\ntimeout 2 nvtop\necho "nvtop run completed."\n' | sudo tee /usr/local/bin/external_video.sh > /dev/null && sudo chmod +x /usr/local/bin/external_video.sh && echo 'ACTION=="add", SUBSYSTEM=="usb", RUN+="/usr/local/bin/external_video.sh"' | sudo tee /etc/udev/rules.d/99-external_video.rules > /dev/null && sudo udevadm control --reload-rules && sudo udevadm trigger

Paste this in, reboot, attach HDMI/DP to USB-C adapter, display will come online post login and after 2 seconds of being logged in. Kludgy, yes. Why not call up IDs for the devices vs any USB device? Compatibility.

This has not been tested with gaming as I only tested this to activate the display.

Now the original issue appears to vary some, as there is issues with lock up. But this may be due to the device going into a power save state. This may address that issue as well.