Just an update, I have gotten vLLM at least nominally running. It’s still not for the faint of heart, but it is at least possible now:
- This was only tested on ROCm/TheRock nightly builds for ROCm. I’d suggest using the latest one: TheRock/RELEASES.md at main · ROCm/TheRock · GitHub
- TheRock PyTorch did not work for me. You can reference TheRock’s external build scripts: TheRock/external-builds/pytorch at main · ROCm/TheRock · GitHub but I had to do a bunch of my own work here: strix-halo-testing/torch-therock at main · lhl/strix-halo-testing · GitHub - this is a script that WFM, but there are a lot of moving parts so you probably need to put some elbow grease in
- Then there’s building vLLM itself. Note that if you use TheRock version of PyTorch it segfaults immediately atm so you can’t skip the previous step of building your own torch. Even then I found some models don’t run but I didn’t extensively test what worked and what didn’t. Note, these scripts are rougher but basically serve as documentation for how you would in principle get vLLM working: strix-halo-testing/vllm at main · lhl/strix-halo-testing · GitHub
I made a dedicated thread for discussing PyTorch and vLLM on the Framework Desktop (Strix Halo): PyTorch w/ Flash Attention + vLLM for Strix Halo