I’m also trying to get the motherboard to work with dGPUs.
@Lincoln_Chen did you end up abandoning the use of the PCIe 4.0 x4 slot? I may have misunderstood, but it seemed like you were last using the NVMe M.2 slot(s) instead?
@Hrothmund how has that ADT-Link adapter worked for you?
In my experiments, I’m using the ADT-Link R23A-AMP (x4 to x16) and then a normal GPU riser, but the connection isn’t stable and keeps dropping to PCIe Gen 1.0. Forcing Linux to stick to Gen 1.0 solves the renegotiation issue but I’m still seeing GPU usage spike up (and power draw) for no reason.
@Lincoln_Chen Thanks! To clarify, was this the one that ended up working OK in the end? I’ve ordered a different one but if it doesn’t work I may try this one out of desperation.
I have been very busy the last couple of months, and I haven’t had a chance to finalize the design. I had most of it done, but I had to allocate my time elsewhere for a while. I didn’t even get a chance to assemble my custom framework motherboard build until my Christmas break. I have ordered all of the components, though, and they are sitting in the manufacturer’s warehouse ready for production. I will have a lot more time for the next little while so I should be able to finish it and test it out.
I never got it working properly, so I set it aside for other projects for the time being. I have a lot to learn on the software side anyway - even if it did work, I’m not sure if I’d know how to configure my llama docker container to offload some layers to the dGPU.
At this point I’m waiting for a firmware update that addresses issues with dGPUs, or for more people to report that they got such a setup working, before I revisit it.
After I pre-ordered my FW desktop, I’ve read through this thread many times, and to be honest, I was worried about my homelab plan using a MAX 395 with an RTX 3070. I’m happy to report that I finally received my Framework desktop mainboard, and with the help of a 10cm PCIe 4.0 x4 to x4 riser card (without supplemental power), my setup is working flawlessly. I’m running CachyOS and getting PCIe 4.0 x4 link capability with resizable BAR enabled. I haven’t encountered any boot issues so far.
I did some research in to your question. Perhaps using disaggregate prefilling is an option. In vllm there is an option to use two vllm instances one for the prefill phase (assigned to the dgpu; the halo strix is less good at prompt processing than a dedicated graphics card or the dgx spark), the other one for the decode phase (assigned to GPU of the halo strix or even it’s CPU). The prefill phase is more computation heavy and the decode phase is more memory intensive. (see vllm/docs/features/disagg_prefill.md at main · vllm-project/vllm · GitHub ) The same technique is used by exo to combine various devices over the network. (see VPMTvC7faJE on youtube; sorry I can’t post more than two links)
The framework desktop has only a pcie 4x4 interface, so bandwidth is limited. This technique is constantly sending the result of the prefill phase to the decode phase, so I think the low pcie bandwidth is less noticable. In vllm it is possible to distribute the load over multiple GPU’s, but due to the memory intensiveness of the decode phase and the pcie bottleneck, this is not really an option. (correct me if I’m wrong)
A bonus is that normally vllm does not allow to use CUDA and rocm at the same time. By splitting this in two vllm instances, i’m hoping this is possible.
Another option could be ikllama ( GitHub - ikawrakow/ik_llama.cpp: llama.cpp fork with additional SOTA quants and improved performance ). But it has no support for two GPU’s?
Update: I tried the 3090 again today, and it worked fine. Variables that have changed:
-I ordered a short cable from ADT-Link. The first one was 20cm, this one is 5cm. I had to bend the longer cable, and it’s pretty stiff, so that might have messed something up. I still had to bend the shorter one a bit - if I had to do it over again, I’d order a 3cm cable.
-When I removed the 3090 a few months ago, I ripped out the nvidia drivers and blacklisted nouveau. When I reinstalled the card today, Fedora seems to have auto-installed the latest nVidia drivers.
I’m still experimenting, but it seems to take bloody forever to load a model - several minutes. Not sure if that’s due to the x4 PCIe link or something else. Once it’s loaded, the performance is decent:
hrethric@delphi:~/llama.cpp$ ./build/bin/llama-bench -m /opt/llm-models/gemma-3-27b-it-q4_0.gguf -p 512 -n 128 -ngl 999
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
I realise this is unrelated, but someone in this thread was talking about making a custom heatsink. Looks like the guys at NFC might’ve made an adapter plate to allow fitting standard cooler mounts (see video here).