I just got my FD and played with it nicely for a few days installing Gentoo Linux.
Then today, after a routine check of booting messages, dmesg | grep amdgpu gives me this:
[ 5.194369] amdgpu: Virtual CRAT table created for CPU
[ 5.194378] amdgpu: Topology: Add CPU node
[ 5.194494] amdgpu 0000:d5:00.0: enabling device (0006 → 0007)
[ 5.194552] amdgpu 0000:d5:00.0: amdgpu: initializing kernel modesetting (IP DISCOVERY 0x1002:0x1586 0xF111:0x000A 0xC1).
[ 5.194579] amdgpu 0000:d5:00.0: amdgpu: register mmio base: 0x90200000
[ 5.194580] amdgpu 0000:d5:00.0: amdgpu: register mmio size: 1048576
[ 5.196586] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 0 <common_v1_0_0> (soc21_common)
[ 5.196588] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 1 <gmc_v11_0_0> (gmc_v11_0)
[ 5.196589] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 2 <ih_v6_0_0> (ih_v6_1)
[ 5.196590] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 3 <psp_v13_0_0> (psp)
[ 5.196591] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 4 <smu_v14_0_0> (smu)
[ 5.196592] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
[ 5.196594] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 6 <gfx_v11_0_0> (gfx_v11_0)
[ 5.196595] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 7 <sdma_v6_0_0> (sdma_v6_0)
[ 5.196596] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 8 <vcn_v4_0_5> (vcn_v4_0_5)
[ 5.196597] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 9 <jpeg_v4_0_5> (jpeg_v4_0_5)
[ 5.196597] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 10 <mes_v11_0_0> (mes_v11_0)
[ 5.196598] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 11 <vpe_v6_1_0> (vpe_v6_1)
[ 5.196599] amdgpu 0000:d5:00.0: amdgpu: detected ip block number 12 <isp_v4_1_1> (isp_ip)
[ 5.196620] amdgpu 0000:d5:00.0: amdgpu: ACPI VFCT table present but broken (too short #2),skipping
[ 5.198763] amdgpu 0000:d5:00.0: ROM [??? 0x00000000 flags 0x20000000]: can’t assign; bogus alignment
[ 5.198766] amdgpu 0000:d5:00.0: amdgpu: Unable to locate a BIOS ROM
[ 5.198766] amdgpu 0000:d5:00.0: amdgpu: Fatal error during GPU init
[ 5.198768] amdgpu 0000:d5:00.0: amdgpu: amdgpu: finishing device.
[ 5.198793] amdgpu 0000:d5:00.0: probe with driver amdgpu failed with error -22
Previously, the amdgpu module was loading its firmware succesfully.
The last things I did were re-compiling the kernel a couple of times and changed the Quiet Boot in BIOS from Enabled to Disabled and back. Now I don’t know which of these lead to this disaster.
The same thing happened in the first day of playing with it but I reloaded the BIOS defaults and re-compiled a stock kernel with config from SystemRescue, which fixed the issue and got the amdgpu module loading successfully.
So, I strongly believe that it is a kernel configuration issue that caused it, but now I did that again with no good results.
Something told the kernel a bad location for the firmware blobs. There’s 2 core artifacts of a kernel build, the vmlinuz kernel binary and the initrd.img initial ram filing system which contains tools to configure hardware and mount filesystems. When you built the initramfs for your custom-built kernel, the steps you followed or the script you used failed to copy the firmwares from /usr/lib/firmware (or similar depending on your distribution), and then they weren’t available for your amdgpu driver when you started.
(You might check this: with the full filesystem mounted much later on in booting your custom-built kernel, you might remove the amdgpu driver’s module with modprobe -r amdgpu and re-insert it with modprobe amdgpu and it will use the firmware available on your mounted root filesystem.)
The kernels on all my machines don’t have an initramfs and never had one. Everything they need to boot from disk is built in: the NVME driver and the BTRFS file system. The rest is in modules, so they can each load their required blobs from /lib/firmware.
When I boot SystemRescue and check dmesg | grep amdgpu everything is loaded fine, no errors, the GPU is initialized properly. So, no hardware, BIOS or GRUB issues and what remains is the kernel.
Obviously something is wrong with this kernel config, because the issue persists. The other modules load their blobs just fine: amdnpu, the Realtek network driver and the Mediatek wireless driver. Only amdgpu fails.
I didn’t write ‘diff that duff kernel config against the SystemRescue one’ because the cpio stage for creating initramfs seems more obvious to me as a failure point. I guess you’ll have to check what’s different between those two kernel configs.
I still want to know what happens if you remove and reload the amdgpu module once you’re sure there’s a filesystem with firmware available.
To get that ACPI VFCT table present but broken (too short #2),skipping error, look at (eg Linus’ tree) Linux kernel drivers/gpu/amd/amdgpu/amdgpu_bios.c:428 to see that is completes the while loop picking through the ip blocks but bounces off the block above it starting from line 409, fails to match one of those checks for the hardware:
When I unload the amdgpu module on a running kernel with /lib/firmware available the screen goes black, can’t do anything anymore and I have to reboot.
I also got some hints from the Gentoo forum, but no luck yet.
If I ever find a solution I’ll post it here… who knows, maybe somebody else will stumble on the same thing.
Please post to pastebin or similar the full dmesg from start of boot to this problem point.
One really needs all the context to make sense of this problem.
The difference between SystemRescue’s kernel and mine at the initialization part of amdgpu is that SystemRescue includes one line that my kernel doesn’t:
amdgpu 0000:c3:00.0: amdgpu: detected ip block number 5 <dce_v1_0_0> (dm)
I will try that after I exhaust everything with the advice that I get on the Gentoo forum and if that leads nowhere. It’s kind of beyond my expertise to do that and what comes after extracting the VBIOS.
Anyway, my thinking is that two days ago I had a perfectly functional amdgpu module (with no errors) and I broke it doing whatever stupid configuration change I don’t remember.
That bug you mentioned should not be exposed on the same kernel source from one day to the next.
I did not mean for you to go any further than getting the BIOS and hexdump it. The vbios.c just does a few extra validation checks, so it helps check that also.
If the hexdump shows what looks like a sensible BIOS. It just means that the BIOS is fine, and the problem is with the amdgpu driver.
If the bios hexdump looks wrong, you might need to fix that bit instead.
Reading that, I am not exactly sure how that fixed the problem. I don’t understand the logical reasoning steps that get from those boot params changing to it fixing this problem.
I would be interested to hear what happens if one adds those back in. Does it break it again, or not.
I was thinking more along the lines of this maybe being some sort of race condition, that has only shown itself as a result of compiling all the modules in as “y” instead of using an initrd that most distros use. I.e. The user using a less well tested method.
And about using an initramfs: of course there are good reasons for using one, as explained in pietinger tutorials mentioned in Gentoo’s forum thread. However, if you get a working kernel with a minimal number of modules just to serve your machine and don’t use disc encryption, you can get along very fine without one if the kernel has built-in the elements that it needs to boot: the disc drivers (NVME, SATA etc.) and the file systems.
Distributions are made to serve the needs of all, a custom kernel can be tailored only to yours.
For example, all distributions kernels are built for a generic CPU, not taking advantage of the instruction set of your machine. So why should I use one with a powerful new strix halo?