Issues with getting a PCIe device detected using v4 BIOS on the 7940HS

Ever since trying to upgrade to any v4.XX BIOS on my 7940HS and trying to use the OCuLink 8i board, I keep having my GPU not be detected at all. I decided to open a different post in order to stop filling the threads from other projects. We previously talked in the MXM board thread.

To summarize, I upgraded to v4.03 again in order to do more testing. I then ran dmem 0xfed815a0 4 -MMIO in EFI Shell, which only returned 0x0000E500. Then I rebooted into my Arch Linux install and ran the script from this post Framework 16 to MXM Gpu - V0.1 Prototype design - #246 by James3 which returned me 0x00A40000 that should hopefully be correct? The GPU is still not detected within Windows 11, but I decided to just check lspci output:

01:00.0 VGA compatible controller: NVIDIA Corporation AD104 [GeForce RTX 4070] (rev a1)
01:00.1 Audio device: NVIDIA Corporation AD104 High Definition Audio Controller (rev a1)

And the GPU was literally there (its nowhere to be seen in Device Manager when I switch to Windows 11), but I couldn’t use nvidia-smi due to:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I then decided to check dmesg:

[    4.479214] nvidia: loading out-of-tree module taints kernel.
[    4.479222] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    4.566784] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[    4.569421] nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[    4.571510] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:2786) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    4.571566] nvidia 0000:01:00.0: probe with driver nvidia failed with error -1
[    4.572066] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    4.572068] NVRM: None of the NVIDIA devices were initialized.
[    4.573231] nvidia-nvlink: Unregistered Nvlink Core, major device number 511

And then a bit later

[    4.989955] nvidia-nvlink: Nvlink Core is being initialized, major device number 508

[    4.990171] i2c i2c-20: Successfully instantiated SPD at 0x50
[    4.993170] i2c i2c-20: Successfully instantiated SPD at 0x51
[    4.993303] piix4_smbus 0000:00:14.0: Auxiliary SMBus Host Controller at 0xb20
[    4.993958] i2c i2c-22: Successfully instantiated SPD at 0x50
[    4.994451] i2c i2c-22: Successfully instantiated SPD at 0x51
[    4.994621] nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[    4.995048] pci_bus 0000:66: busn_res: [bus 66] is released
[    4.995524] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:2786) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    4.995579] nvidia 0000:01:00.0: probe with driver nvidia failed with error -1
[    4.995694] pci_bus 0000:67: busn_res: [bus 67] is released
[    4.996315] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    4.996318] NVRM: None of the NVIDIA devices were initialized.
[    4.996554] pci_bus 0000:68: busn_res: [bus 68] is released
[    4.996697] pci_bus 0000:69: busn_res: [bus 69] is released
[    4.996918] pci_bus 0000:6a: busn_res: [bus 6a] is released
[    4.997034] pci_bus 0000:65: busn_res: [bus 65-6a] is released
[    4.997213] nvidia-nvlink: Unregistered Nvlink Core, major device number 508

Then something regarding Intel?

[    5.061085] snd_hda_intel 0000:01:00.1: Unable to change power state from D0 to D0, device inaccessible
[    5.062675] cros-charge-control cros-charge-control.4.auto: Framework charge control detected, preventing load
[    5.064864] snd_hda_intel 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
[    5.065635] snd_hda_intel 0000:01:00.1: Disabling MSI
[    5.065644] snd_hda_intel 0000:01:00.1: Handle vga_switcheroo audio client
[    5.206657] hdaudio hdaudioC1D7: no AFG or MFG node found
[    5.207080] snd_hda_intel 0000:01:00.1: no codecs initialized
[    5.208033] snd_hda_intel 0000:01:00.1: GPU sound probed, but not operational: please add a quirk to driver_denylist

And then yet again this

[    5.312256] nvidia-nvlink: Nvlink Core is being initialized, major device number 508

[    5.312643] intel_rapl_common: Found RAPL domain package
[    5.313352] intel_rapl_common: Found RAPL domain core
[    5.313633] mt7921e 0000:03:00.0: WM Firmware Version: ____000000, Build Time: 20251118163234
[    5.313811] amd_atl: AMD Address Translation Library initialized
[    5.314524] input: HD-Audio Generic HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:08.1/0000:c3:00.1/sound/card2/input40
[    5.315600] nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[    5.316259] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:2786) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[    5.316302] nvidia 0000:01:00.0: probe with driver nvidia failed with error -1
[    5.317288] NVRM: The NVIDIA probe routine failed for 1 device(s).
[    5.317290] NVRM: None of the NVIDIA devices were initialized.
[    5.317840] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[    5.320166] snd_pci_ps 0000:c3:00.5: enabling device (0000 -> 0002)

It seems like it just cannot communicate with the GPU for some odd reason? But like I said, I have no issues with BIOS v3.07, so it feels like it might be something specific to v4.XX.

Ok, so some progress.
If the lspci worked as some point, the PCIe lanes have trained up and transferred the content that is returned by lspci.
I.e. “VGA compatible controller: NVIDIA Corporation AD104 [GeForce RTX 4070] (rev a1)”
So the PCIe lanes were up, but the dmesg log appears to show them disappearing again.
So, my guess here would be problems with signal quality.

Maybe try “sudo lspci -vv” and look for the LnkSta lines.
That will give an idea as to what speeds it is managing to link train at.
The PCIe lanes are PCIe Gen 4, but you can force them down to PCIe Gen 3 simply by unplugging the power adapter from the FW16 laptop.

Unfortunately it does not contain any relevant info about the link speed:

01:00.0 VGA compatible controller: NVIDIA Corporation AD104 [GeForce RTX 4070] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Gigabyte Technology Co., Ltd Device 40ee
	!!! Unknown header type 7f
	Interrupt: pin ? routed to IRQ 144
	IOMMU group: 14
	Region 0: Memory at 90000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 7c00000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at 8000000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at a000 [size=128]
	Expansion ROM at 91080000 [disabled] [size=512K]
	Kernel modules: nouveau, nvidia_drm, nvidia

01:00.1 Audio device: NVIDIA Corporation AD104 High Definition Audio Controller (rev a1) (prog-if 00 [HDA compatible])
	Subsystem: Gigabyte Technology Co., Ltd Device 40ee
	!!! Unknown header type 7f
	Interrupt: pin ? routed to IRQ 47
	IOMMU group: 14
	Region 0: Memory at 91000000 (32-bit, non-prefetchable) [size=16K]
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

Unfortunately even this does nothing. I did check that the BIOS setting is enabled to switch to gen 3 on battery, but it still results in the same output when unplugged and I reboot.

That is it failing to read the PCIe configuration space. So, again, problems with the PCIe link training or problems with the PCIe link signal quality.

But why does this happen only with a different BIOS version? I’ve never had any issues with v3.07

I don’t know regarding the BIOS.
You can try unbinding the driver and binding it again, to try to force a reset.
If I was diagnosing this, I would put the vector scope on and check the eye diagrams.

Here is lspci -vv on BIOS v3.07 when I run

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glmark2

to put a load on it, so it kicks up to PCIe x8 4.0 (can also confirm it within CPU-X)

It is indeed going up from 2.5 GT/s to 16 GT/s that PCIe 4.0 gives when I put a load on it. I don’t know what even makes it not want to work with newer BIOS.

nvidia-smi also gives me a working regular output:

Sat Feb 14 11:26:10 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P0             51W /  200W |      31MiB /  12282MiB |     94%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           66379      G   glmark2                                   8MiB |
+-----------------------------------------------------------------------------------------+

I would really appreciate help from the Framework team on this one, as its puzzling.

I tried an OCuLink cable thats half the length of the one I usually use and it gives me the same result on v4 BIOS, but I still highly doubt its related to the board or cable since I get the full 4.0 bandwidth on v3.07 with both cables. It seems like something just broke or I’m doing something wrong for v4 BIOS.

I am unsure if this is a hardware or firmware issue at this point since its just odd… How am I getting full 4.0 link with one version and then the link just completely fails on another version?

What’s even weirder is that a GPU on the other end is detected for a tester with a 7840HS, so it might be a 7940HS specific issue perhaps? It did require doing some BIOS resets and battery disconnects (the GPU just wasn’t being detected, even though it was fully working on v3.07, so there’s still something funky going on where it just “forgets” that a PCIe device is connected) but they did get it running after repeating multiple times. I did attempt to do it as well, but to no avail.

I finally got around to connecting a pi pico to the EC tool console so I could send the gpucfg console command over. And what I noticed between 3.07 and 4.03 is this:

3.07

[77363.511800 GPU Descriptor Valid]
[77363.512500 From EEPROM]
[77363.513100 Header: V:0.1 HW:0x8 SN:FRAOCULINKTERRAILS CRC:0xD0C39EA8]
[77363.514600 Len: 92 Dcrc32:0x116E0D9D]
[77363.515400 SN: FRAOCULINKTERRAILS]
[77363.516200 MMIO GPU_CONTROL=0x0]
[77363.516900 MMIO GPU_TYPE =0x4]
[77363.517600 Interposer]
[77363.519200 LEFT: 12, RIGHT 12 RAW 2134, 2134]
[77363.520200 GPIOS]
[77363.520700 GPIO0 0]
[77363.521400 GPIO1 0]
[77363.522000 GPIO2 1]
[77363.522700 GPIO3 0]
[77363.523300 S5_INT 1]
[77363.523900 ALERTn 1]
[77363.524800 EDPMUX 0]
[77363.525400 SSDMUX 1]
[77363.526100 VSYSEN 0]
[77363.526700 VADP_EN 0]
[77363.527300 FAN_EN 1]
[77363.528000 GPUPWR_EN 1]
[77363.528600 ECPWM_EN 0]
[77363.529300 ALW_EN 1]
[77363.530000 BAY DOOR Closed]
[77363.530600 5VALW_REQ 0x06]

4.03

[104.767600 GPU Descriptor Valid]
[104.768200 From EEPROM]
[104.768600 Header: V:0.1 HW:0x8 SN:FRAOCULINKTERRAILS CRC:0xD0C39EA8]
[104.769700 Len: 92 Dcrc32:0x116E0D9D]
[104.770400 SN: FRAOCULINKTERRAILS]
[104.771000 MMIO GPU_CONTROL=0x1]
[104.771600 MMIO GPU_TYPE =0x4]
[104.772400 Interposer]
[104.773900 LEFT: 12, RIGHT 12 RAW 2136, 2134]
[104.774900 GPIOS]
[104.775200 GPIO0 0]
[104.775700 GPIO1 0]
[104.776200 GPIO2 1]
[104.776700 GPIO3 0]
[104.777200 S5_INT 1]
[104.777700 ALERTn 1]
[104.778200 EDPMUX 0]
[104.778700 SSDMUX 1]
[104.779200 VSYSEN 0]
[104.779700 VADP_EN 0]
[104.780100 FAN_EN 1]
[104.780700 GPUPWR_EN 0]
[104.781100 ECPWM_EN 0]
[104.781600 ALW_EN 1]
[104.782100 BAY DOOR Closed]
[104.783100 5VALW_REQ 0x06]

It seems like GPUPWR_EN is 0 on 4.03, but that shouldn’t play a role. I’m more interested in GPU_CONTROL being 0x1 on 4.03 and 0x0 on 3.07.

It seems like bit 0 is SET_APU_MUX for that piece of memory, specifically
host_get_memmap(EC_CUSTOMIZED_MEMMAP_GPU_CONTROL) within EC code.

I tried changing this value from within EC code and flashing my laptop with it, but unfortunately it didn’t change anything. But I did also try using different PCIe configurations in the firmware on my board. The board starts working immediately if I use 1x4 or 2x4. It seems to be specific to 1x8 configurations and its driving me insane since it still immediately starts working when I revert back to 3.07 BIOS.

I did just try flashing the EC firmware that is on 3.07 BIOS, but unfortunately I still get the same behavior. Looks like its something deeper with the BIOS itself and not the EC.

I wish I had another x8 device to test this with, but unfortunately all I have are x4 or less and those work without issues.

sadly this cannot be done as the driver is not even loaded. Attempting to load it with modprobe just throws an error with “no such device”.

Guess this was enough debugging to determine it pointless to upgrade past 3.07 for now.

Hi all, I do not know if this could be helpful, but I managed to get OCuLink 8i to work on both 3.07 and 4.03, and I just wanted to share the exact steps I did. What I did was to:

  • get it to work on 3.07 with steps mentioned in the other thread.
  • then update to 4.03 where I was asked to also update firmware for the keyboard, which I did NOT do.
  • Then, after updating the BIOS, it did not work. Then I did update the keyboard firmware.
  • Then I tried again, it did not work. I then restarted about 3 times, kept trying error 43 fixer, BIOS reset, and switching battery off in BIOS, back and forth. Nothing changed.
  • And then, after an additional restart, after 15 seconds in Windows 11, it just worked.

So what I did was a bit messy, and I do not know why it worked. Only thing I did was to reiterate steps. I cannot imagine how this could help, but wanted to share anyways. I have the 7840HS, and got it to work with Windows 11 home and a RTX 3090.

The oddest thing to me is why does it require so many restarts and battery disconnects to get it to work when nothing but the BIOS version changed.

Maybe if I kept resetting and rebooting it would start working for me as well, but I tried many times with no success.

I would be willing to send a board to someone who has a strong enough oscilloscope and the required know-how in order to check the signal integrity. The issue is the person would also need to have the FW16, OCuLink cable and the x16 adapter as well. And oscilloscopes of that kind probably means it would be no hobbyist but someone actually working as an electrical engineer, as I see no way a hobbyist would willingly spend thousands or even tens of thousands on an oscilloscope that can view 16 GHz/s signals.

Once you go above 8Ghz, things get considerably more difficult to test. Simply placing a test probe on to the board modifies its behaviour.
Also, above 8Ghz, the test equipment cost jumps up considerably.
I used to have access to do these sorts of tests, but i moved teams and don’t have access any more.