Framework Desktop:- amdgpu unable to locate a BIOS ROM

I just got my FD and played with it nicely for a few days installing Gentoo Linux.

Then today, after a routine check of booting messages, dmesg | grep amdgpu gives me this:

[ 5.194369] amdgpu: Virtual CRAT table created for CPU
[ 5.194378] amdgpu: Topology: Add CPU node
[ 5.194494] amdgpu 0000:d5:00.0: enabling device (0006 → 0007)
[ 5.194552] amdgpu 0000:d5:00.0: amdgpu: initializing kernel modesetting (IP DISCOVERY 0x1002:0x1586 0xF111:0x000A 0xC1).
[ 5.194579] amdgpu 0000:d5:00.0: amdgpu: register mmio base: 0x90200000
[ 5.194580] amdgpu 0000:d5:00.0: amdgpu: register mmio size: 1048576
[ 5.196586] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (soc21_common)
[ 5.196588] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 1 <gmc_v11_0_0> (gmc_v11_0)
[ 5.196589] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 2 <ih_v6_0_0> (ih_v6_1)
[ 5.196590] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 3 <psp_v13_0_0> (psp)
[ 5.196591] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 4 <smu_v14_0_0> (smu)
[ 5.196592] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[ 5.196594] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 6 <gfx_v11_0_0> (gfx_v11_0)
[ 5.196595] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 7 <sdma_v6_0_0> (sdma_v6_0)
[ 5.196596] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 8 <vcn_v4_0_5> (vcn_v4_0_5)
[ 5.196597] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0_5> (jpeg_v4_0_5)
[ 5.196597] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 10 <mes_v11_0_0> (mes_v11_0)
[ 5.196598] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 11 <vpe_v6_1_0> (vpe_v6_1)
[ 5.196599] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 12 <isp_v4_1_1> (isp_ip)
[ 5.196620] amdgpu 0000:d5:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[ 5.198763] amdgpu 0000:d5:00.0: ROM [??? 0x00000000 flags 0x20000000]: can’t assign; bogus alignment
[ 5.198766] amdgpu 0000:d5:00.0: amdgpu: Unable to locate a BIOS ROM
[ 5.198766] amdgpu 0000:d5:00.0: amdgpu: Fatal error during GPU init
[ 5.198768] amdgpu 0000:d5:00.0: amdgpu: amdgpu: finishing device.
[ 5.198793] amdgpu 0000:d5:00.0: probe with driver amdgpu failed with error -22

Previously, the amdgpu module was loading its firmware succesfully.

The last things I did were re-compiling the kernel a couple of times and changed the Quiet Boot in BIOS from Enabled to Disabled and back. Now I don’t know which of these lead to this disaster.

The same thing happened in the first day of playing with it but I reloaded the BIOS defaults and re-compiled a stock kernel with config from SystemRescue, which fixed the issue and got the amdgpu module loading successfully.

So, I strongly believe that it is a kernel configuration issue that caused it, but now I did that again with no good results.

Did anybody have this issue and solved it?

1 Like

Something told the kernel a bad location for the firmware blobs. There’s 2 core artifacts of a kernel build, the vmlinuz kernel binary and the initrd.img initial ram filing system which contains tools to configure hardware and mount filesystems. When you built the initramfs for your custom-built kernel, the steps you followed or the script you used failed to copy the firmwares from /usr/lib/firmware (or similar depending on your distribution), and then they weren’t available for your amdgpu driver when you started.

(You might check this: with the full filesystem mounted much later on in booting your custom-built kernel, you might remove the amdgpu driver’s module with modprobe -r amdgpu and re-insert it with modprobe amdgpu and it will use the firmware available on your mounted root filesystem.)

1 Like

Kenny, thanks for your input.

The kernels on all my machines don’t have an initramfs and never had one. Everything they need to boot from disk is built in: the NVME driver and the BTRFS file system. The rest is in modules, so they can each load their required blobs from /lib/firmware.

When I boot SystemRescue and check dmesg | grep amdgpu everything is loaded fine, no errors, the GPU is initialized properly. So, no hardware, BIOS or GRUB issues and what remains is the kernel.

Obviously something is wrong with this kernel config, because the issue persists. The other modules load their blobs just fine: amdnpu, the Realtek network driver and the Mediatek wireless driver. Only amdgpu fails.

It still beats me why this happens.

I didn’t write ‘diff that duff kernel config against the SystemRescue one’ because the cpio stage for creating initramfs seems more obvious to me as a failure point. I guess you’ll have to check what’s different between those two kernel configs.

I still want to know what happens if you remove and reload the amdgpu module once you’re sure there’s a filesystem with firmware available.

To get that ACPI VFCT table present but broken (too short #2),skipping error, look at (eg Linus’ tree) Linux kernel drivers/gpu/amd/amdgpu/amdgpu_bios.c:428 to see that is completes the while loop picking through the ip blocks but bounces off the block above it starting from line 409, fails to match one of those checks for the hardware:

393:	while (offset < tbl_size) {
...
409:		if (vhdr->ImageLength &&
				vhdr->PCIBus == adev->pdev->bus->number &&
				vhdr->PCIDevice == PCI_SLOT(adev->pdev->devfn) &&
				vhdr->PCIFunction == PCI_FUNC(adev->pdev->devfn) &&
				vhdr->VendorID == adev->pdev->vendor &&
				->DeviceID == adev->pdev->device) {
					...
			}
		}

428:	dev_info(adev->dev, "ACPI VFCT table present but broken (too short #2),skipping\n");
		return false;

Do you also purge ACPI or load a custom ACPI blob in your kernels? Note that the blob it’s looking at is pulled from ACPI tables at drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c:382 (using a function defined here, drivers/acpi/acpica/tbxface.c:297) and I’d guess that vhdr->ImageLength is zero.

I’m so glad I don’t build my own kernels any more – the kconfig system can be a minefield of things going against my assumptions.

When I unload the amdgpu module on a running kernel with /lib/firmware available the screen goes black, can’t do anything anymore and I have to reboot.

I also got some hints from the Gentoo forum, but no luck yet.

If I ever find a solution I’ll post it here… who knows, maybe somebody else will stumble on the same thing.

Please post to pastebin or similar the full dmesg from start of boot to this problem point.
One really needs all the context to make sense of this problem.

Full dmesg here.

Also, the full dmesg from SystemRescue, which successfully initializes the GPU, is here: https://paste.gentoo.zip/qs72FRg5

I just got it and starting to look at both, but I’m not sure what I’m looking for.

The difference between SystemRescue’s kernel and mine at the initialization part of amdgpu is that SystemRescue includes one line that my kernel doesn’t:

amdgpu 0000:c3:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)

What does that mean?

I guess there is some bug in the amdgpu driver.
The ACPI VFCT appears to be present, so the amdgpu driver should find it.

[    0.003333] ACPI: VFCT 0x0000000079FD0000 004484 (v01 INSYDE EDK2     00000001 ACPI 00040000)
[    0.003371] ACPI: Reserving VFCT table memory at [mem 0x79fd0000-0x79fd4483]

[    4.349898] amdgpu 0000:d5:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[    4.352055] amdgpu 0000:d5:00.0: ROM [??? 0x00000000 flags 0x20000000]: can't assign; bogus alignment
[    4.352057] amdgpu 0000:d5:00.0: amdgpu: Unable to locate a BIOS ROM
[    4.352058] amdgpu 0000:d5:00.0: amdgpu: Fatal error during GPU init
[    4.352060] amdgpu 0000:d5:00.0: amdgpu: amdgpu: finishing device.

Try running to vbios tool from here:

You should see something like this (The below is from a FW16 7840HS):

hexdump -C vbios*bin |less

00000000  55 aa 21 00 00 00 00 00  00 00 00 00 00 00 00 00  |U.!.............|
00000010  00 00 00 00 00 00 00 00  c0 01 00 00 00 00 49 42  |..............IB|
00000020  4d da 00 00 00 00 00 00  00 00 00 00 00 00 00 04  |M...............|
00000030  20 37 36 31 32 39 35 35  32 30 00 00 00 00 00 00  | 761295520......|
00000040  00 00 00 00 00 00 00 00  94 01 00 00 00 00 00 00  |................|
00000050  30 34 2f 31 38 2f 32 35  2c 30 30 3a 35 34 3a 31  |04/18/25,00:54:1|
00000060  39 00 00 00 00 00 00 00  00 00 00 00 00 00 80 00  |9...............|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  31 31 33 2d 50 48 58 47  45 4e 45 52 49 43 2d 30  |113-PHXGENERIC-0|
00000090  30 31 00 50 48 4f 45 4e  49 58 00 50 43 49 5f 45  |01.PHOENIX.PCI_E|
000000a0  58 50 52 45 53 53 00 44  44 52 35 00 0d 0a 41 4d  |XPRESS.DDR5...AM|
000000b0  44 20 41 4d 44 5f 50 48  4f 45 4e 49 58 5f 47 45  |D AMD_PHOENIX_GE|
000000c0  4e 45 52 49 43 20 20 20  20 20 20 20 20 20 20 20  |NERIC           |
000000d0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
000000f0  20 20 20 20 20 20 20 20  20 20 0d 0a 00 0d 0a 20  |          ..... |
00000100  0d 0a 00 28 43 29 20 31  39 38 38 2d 32 30 32 32  |...(C) 1988-2022|
00000110  2c 20 41 64 76 61 6e 63  65 64 20 4d 69 63 72 6f  |, Advanced Micro|
00000120  20 44 65 76 69 63 65 73  2c 20 49 6e 63 2e 00 41  | Devices, Inc..A|
00000130  54 4f 4d 42 49 4f 53 42  4b 2d 41 4d 44 20 56 45  |TOMBIOSBK-AMD VE|
00000140  52 30 32 32 2e 30 31 32  2e 30 30 30 2e 30 32 39  |R022.012.000.029|
00000150  2e 30 30 30 30 30 31 00  50 48 4f 45 4e 49 58 2e  |.000001.PHOENIX.|
00000160  62 69 6e 20 00 30 30 30  30 30 30 30 30 00 30 30  |bin .00000000.00|
00000170  31 34 38 32 31 35 00 20  20 20 20 20 20 20 20 00  |148215.        .|
00000180  41 4d 44 5f 50 48 4f 45  4e 49 58 5f 47 45 4e 45  |AMD_PHOENIX_GENE|
00000190  52 49 43 00 2c 00 02 03  41 54 4f 4d 00 00 00 00  |RIC.,...ATOM....|
000001a0  58 01 e5 01 ac 00 00 00  00 00 00 00 02 10 02 10  |X...............|
000001b0  c0 01 d0 39 00 03 00 00  00 00 00 00 00 02 03 00  |...9............|
000001c0  50 43 49 52 02 10 bf 15  00 00 18 00 00 00 80 03  |PCIR............|
000001d0  21 00 0c 16 00 80 00 00  41 4d 44 20 41 54 4f 4d  |!.......AMD ATOM|
000001e0  42 49 4f 53 00 5f d6 b6  3d 00 00 00 00 00 00 00  |BIOS._..=.......|
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
... snipped

I will try that after I exhaust everything with the advice that I get on the Gentoo forum and if that leads nowhere. It’s kind of beyond my expertise to do that and what comes after extracting the VBIOS.

Anyway, my thinking is that two days ago I had a perfectly functional amdgpu module (with no errors) and I broke it doing whatever stupid configuration change I don’t remember.

That bug you mentioned should not be exposed on the same kernel source from one day to the next.

I did not mean for you to go any further than getting the BIOS and hexdump it. The vbios.c just does a few extra validation checks, so it helps check that also.
If the hexdump shows what looks like a sensible BIOS. It just means that the BIOS is fine, and the problem is with the amdgpu driver.
If the bios hexdump looks wrong, you might need to fix that bit instead.

The issue is solved.

Anybody who might get interested in this, please check thread in Gentoo forum.

Hey there, I’m glad you got a fix for this.

So these kernel boot params apparently caused the problem (see the gentoo forum thread linked above):

pci=assign-busses,hpbussize=0x33,hpmemsize=4M,realloc pciehp.pciehp_poll_mode=1

Reading that, I am not exactly sure how that fixed the problem. I don’t understand the logical reasoning steps that get from those boot params changing to it fixing this problem.
I would be interested to hear what happens if one adds those back in. Does it break it again, or not.

I was thinking more along the lines of this maybe being some sort of race condition, that has only shown itself as a result of compiling all the modules in as “y” instead of using an initrd that most distros use. I.e. The user using a less well tested method.

Yes, I tried adding them back in.

amdgpu fails again.

And about using an initramfs: of course there are good reasons for using one, as explained in pietinger tutorials mentioned in Gentoo’s forum thread. However, if you get a working kernel with a minimal number of modules just to serve your machine and don’t use disc encryption, you can get along very fine without one if the kernel has built-in the elements that it needs to boot: the disc drivers (NVME, SATA etc.) and the file systems.

Distributions are made to serve the needs of all, a custom kernel can be tailored only to yours.

For example, all distributions kernels are built for a generic CPU, not taking advantage of the instruction set of your machine. So why should I use one with a powerful new strix halo?