Optimal LLM Inference Configuration for Ryzen AI MAX+ 395 on Proxmox with GPU Passthrough — Seeking Community Input

Proxmox + Ryzen AI MAX+ 395: Optimizing LLM Inference with GPU Passthrough

Hi together,

We run a 2-node Proxmox cluster with Ryzen AI MAX+ 395 CPUs (124 GB RAM each) for self-hosted LLM inference. GPU passthrough (VFIO) to Ubuntu 24.04 VMs works, but we’re only getting 10.9 tokens/sec on Qwen 3.5 27B with the Vulkan backend. AMD’s official guide achieves 9.45 t/s on a 1 trillion parameter model across 4 bare-metal nodes. Something is clearly leaving performance on the table. Looking for guidance on the optimal configuration path — especially around VRAM/GTT, ROCm in VMs, and whether Proxmox passthrough introduces overhead on UMA architectures.


Our Setup

Hardware:

  • 2x AMD Ryzen AI MAX+ 395 (32 cores, 124 GB RAM each) — test is currently only done with 1 AI MAX system as described

  • Integrated Radeon 8060S (GFX1151, RDNA 4)

Virtualization:

  • Proxmox VE 9.x cluster (2 nodes)

  • Node 1 (PVE1): KVM VM (VM 101) with GPU passthrough — runs Ollama, LiteLLM, Langfuse, Open WebUI, SearXNG

  • Node 2 (PVE2): LXC container — runs Ollama CPU-only (no GPU passthrough configured yet)

VM 101 Configuration:

  • Ubuntu 24.04 LTS

  • 24 vCPUs, 96 GB RAM

  • GPU passthrough via VFIO (3 PCI devices: display controller, 2x audio)

  • cpu: host, machine: q35, SeaBIOS

  • Each PCI device passed individually (hostpci0/1/2)

Current BIOS & Kernel:

  • BIOS VRAM: 2 GB (we know this needs to change / could change)

  • Host kernel params: amd_iommu=on iommu=pt initcall_blacklist=sysfb_init amdgpu.gttsize=131072 ttm.pages_limit=33554432

  • VM kernel params: amd_iommu=off amdgpu.gttsize=98304 ttm.pages_limit=33554432

Current GPU detection inside VM:


amdgpu: VRAM: 2048M

amdgpu: GART: 512M

amdgpu: 48276M of GTT memory ready

Inference stack:

  • Ollama 0.18.2 with OLLAMA_VULKAN=1 (Vulkan backend)

  • HSA_OVERRIDE_GFX_VERSION=11.0.1 (GFX1151 workaround)

  • Flash Attention enabled in Ollama

  • Models: Qwen 3.5 27B, Qwen 2.5 Coder 32B, Qwen 2.5 72B

Models served via:

  • LiteLLM gateway (OpenAI-compatible API, load balancing, per-user keys)

  • Open WebUI (chat frontend)

  • Various API clients and agents


Our Benchmark Results (Vulkan Baseline)

Model: qwen3.5:27b — Ollama 0.18.2, Vulkan backend, 48 GB GTT

| Test | Prompt Tokens | Completion Tokens | Prompt Eval (t/s) | Generation (t/s) | Total Time |

|------|:—:|:—:|:—:|:—:|:—:expressionless:

| Short creative text | 26 | 1024 | 87.5 | 10.9 | 94.75s |

| Logic/reasoning puzzle | 80 | 1024 | 118.5 | 10.9 | 95.20s |

| Code generation | 47 | 1024 | 120.3 | 10.9 | 94.92s |

| Long input summarization | 243 | 1024 | 177.6 | 10.8 | 96.16s |

| Average | | | | 10.9 | 381s total |

For reference, AMD’s official benchmarks with llama.cpp + ROCm + rocWMMA Flash Attention on a 4-node cluster running Kimi K2.5 (1T parameters, Q2_K_XL) achieve:

  • 9.45 t/s generation (128 tokens, with FA)

  • 8.30 t/s at 8192 context length with FA (vs 3.46 without — 2.4x improvement)

  • 100.77 t/s prompt processing at batch=4096 with FA

Our 27B model should be ~37x smaller than their 1T model, yet our generation speed is barely faster. This strongly suggests Vulkan is the bottleneck.


What We’ve Learned (and Plan to Do)

After reading AMD’s resources, we’ve identified three key changes:

1. BIOS VRAM: 512 MB (not 64 GB)

AMD’s official cluster guide sets BIOS VRAM to just 512 MB and relies entirely on GTT/TTM kernel parameters for memory allocation (120 GB GTT on 128 GB systems). This was also confirmed in the [Framework community]
ryzen-ai-max-395-how-should-i-configure-the-gpu-vram-from-the-bios-settings/76753/8

We originally planned to set BIOS VRAM to 64 GB based on earlier Ollama guidance. AMD’s approach seems better — more flexible, and doesn’t lock memory into a fixed VRAM pool.

2. ROCm 7.2 — Now Supports GFX1151

AMD’s ROCm for Ryzen install guide:

  • Ubuntu 24.04 with linux-oem-24.04c kernel (6.14+)

  • amdgpu-install --usecase=rocm --no-dkms

This would replace our Vulkan workaround and the HSA_OVERRIDE_GFX_VERSION hack.

3. llama.cpp + ROCm (HIP) Instead of Ollama

AMD’s guide uses llama.cpp compiled with -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON for rocWMMA-accelerated Flash Attention. They also recommend the Lemonade SDK for pre-built gfx1151 binaries.

Key features not available in Ollama:

  • rocWMMA Flash Attention — 2.4x speedup at long contexts

  • --no-mmap — pins models in GTT pool, avoids memory-mapping overhead

  • Batch/ubatch tuning — up to 2x prompt processing improvement

  • RPC multi-node — distribute a single model across multiple machines


Open Questions for the Community

We’d really appreciate input from anyone running Strix Halo / Ryzen AI MAX+ 395 for inference, especially in virtualized environments:

Q1: Does VFIO GPU passthrough add overhead to GTT memory on UMA?

AMD’s guide runs on bare-metal Framework Desktops. We run inside a Proxmox KVM VM with VFIO passthrough. On a UMA architecture where CPU and GPU share the same physical memory:

  • Does the VFIO/IOMMU translation layer add latency to GTT memory accesses?

  • Is there a measurable difference between bare-metal GTT and passthrough GTT?

  • Should we consider running inference on bare metal instead of in a VM?

For context, our VM gets 96 GB of the 124 GB system RAM. The GPU sees the memory through VFIO.

Q2: OEM kernel (6.14) inside a Proxmox VM — any issues?

ROCm for Ryzen requires linux-oem-24.04c (kernel 6.14-1018 or newer). Has anyone run this kernel inside a KVM VM with VFIO GPU passthrough? Concerns:

  • Are there known VFIO/IOMMU regressions in the OEM kernel?

  • Does the --no-dkms ROCm install path work correctly in a VM context?

  • Any interactions between the OEM kernel and Proxmox’s QEMU version?

Q3: GTT size limit in a 96 GB VM?

AMD uses amdgpu.gttsize=120000 on systems with 128 GB RAM. Our VM has 96 GB. We plan to set amdgpu.gttsize=92160 (90 GB, leaving 6 GB for OS/Docker/services).

  • Is 90 GB GTT in a 96 GB VM reasonable, or should we leave more headroom?

  • Does the GTT allocation actually reserve physical memory, or is it a virtual limit?

  • Has anyone experienced OOM issues with aggressive GTT settings in a VM?

Q4: Ollama ROCm vs llama.cpp HIP — does Ollama include rocWMMA FA?

We currently use Ollama because it provides convenient model management (pull, serve, multi-model). If we switch to ROCm:

  • Does Ollama’s ROCm backend leverage rocWMMA Flash Attention, or is that a llama.cpp build-time option only?

  • Has anyone benchmarked Ollama ROCm vs llama.cpp HIP on GFX1151 specifically?

  • If we need to switch to llama.cpp for performance, llama-server provides an OpenAI-compatible API — has anyone used it as a production Ollama replacement?

Q5: Multi-node RPC through Proxmox — feasible?

Our second Proxmox node has the same hardware but currently runs CPU-only in an LXC container. If we set up GPU passthrough on PVE2 as well:

  • Can llama.cpp RPC work across two VMs on separate Proxmox hosts over a standard Ethernet link?

  • AMD uses 5 Gbps Ethernet for their 4-node cluster. Our nodes are on the same 192.168.21.0/24 subnet (1 Gbps). Is this a bottleneck for RPC inference?

  • What’s the minimum network bandwidth needed for RPC to be worthwhile vs. running separate Ollama instances with load balancing?

Q6: BIOS VRAM 512 MB vs. larger — does it matter with passthrough?

On bare metal, 512 MB BIOS VRAM + large GTT is the validated path. But with VFIO passthrough:

  • Does the passed-through GPU see the BIOS VRAM allocation correctly?

  • Our current 2 GB BIOS VRAM shows up correctly in the VM (amdgpu: VRAM: 2048M). Will 512 MB also pass through properly?

  • Is there any scenario where a larger BIOS VRAM helps with passthrough specifically (e.g., GPU initialization, ROM loading)?

Q7: Second GPU passthrough (PVE2) — KVM vs LXC?

PVE2 currently runs an LXC container. For GPU passthrough:

  • Should we switch PVE2 to a KVM VM as well (like PVE1)?

  • Has anyone gotten Strix Halo iGPU passthrough working into an LXC container? (We’re aware LXC passthrough is more limited, but the overhead is lower.)

  • If we go KVM on both nodes, what’s the realistic performance cost of running Ollama/llama.cpp in a VM vs bare metal on this hardware?


What We’d Like to Achieve

  1. Maximize generation throughput on Qwen 3.5 27B (our primary model) — currently 10.9 t/s, hoping for 20+ t/s with ROCm + FA

  2. Utilize both GPUs — PVE2’s Radeon 8060S is currently idle

  3. Keep the LiteLLM/Open WebUI stack — we need the API gateway, multi-user auth, and audit logging, so whatever inference backend we choose must expose an OpenAI-compatible API

  4. Minimize maintenance complexity — we’re a small team, so simpler is better

Any experience, benchmarks, or advice from people running Strix Halo in virtualized inference setups would be incredibly helpful. Thanks!


Hardware: 2x AMD Ryzen AI MAX+ 395, 124 GB RAM each, Proxmox VE 9.x, Ubuntu 24.04 VMs

We are planning to expand the setup to eight systems in total.

Current stack: Ollama 0.18.2 (Vulkan) + LiteLLM + Langfuse + Open WebUI

Models: Qwen 3.5 27B, Qwen 2.5 Coder 32B, Qwen 2.5 72B

1 Like

Kimi K2.5 uses a MoE architecture; it has 1.1T total parameters but only 32b active ones.
Qwen3.5 27b is a dense model.

A rough comparison of inference compute cost at similar quants is 32b to 27b, i.e. you should expect the Kimi K2.5 to be ~20% slower on similar systems if using similar quantizations.

You didn’t mention what quantization of Qwen3.5 27b you used for this. If you used something like Q6, this gives 3x bigger weights than the Q2 that Kimi K2.5 uses in your example. Read this as “3x slower“.

Hope this helps.
I am getting 9.45 t/s at Q6, that’s why i picked that quant as an example:)

1 Like