AMD Strix Halo Llama.cpp Installation Guide for Fedora 42

Lars_Urban · October 4, 2025, 10:55am

AMD Strix Halo Llama.cpp Installation Guide for Fedora 42

This guide walks you through setting up LLM inference on AMD Ryzen AI Max “Strix Halo” integrated GPUs using Fedora 42.

Prerequisites

Fresh Fedora 42 installation
AMD Ryzen AI Max processor (Strix Halo)
At least 128 GB RAM recommended for large models
Internet connection

Step 1: Configure Kernel Parameters

These parameters enable unified memory and optimal GPU performance.

1.1 Add Kernel Parameters Using grubby

Fedora 42 uses grubby to manage kernel parameters. Run this single command:

sudo grubby --update-kernel=ALL --args='amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432'

What these parameters do:

amd_iommu=off - Disables IOMMU for lower latency
amdgpu.gttsize=131072 - Enables unified GPU/system memory (128 GiB)
ttm.pages_limit=33554432 - Allows large pinned memory allocations (128 GiB)

1.2 Verify the Parameters Were Added

sudo grubby --info=ALL | grep args

You should see your added parameters in the output.

1.3 Reboot

sudo reboot

1.4 Verify Kernel Parameters (After Reboot)

cat /proc/cmdline

You should see your added parameters in the output.

Step 2: Configure BIOS Settings

Before proceeding, configure your BIOS:

Reboot and enter BIOS setup
Find GPU memory allocation settings
Set GPU Memory to 512 MB (minimum required)
Save and exit

Step 3: Install Toolbx

Toolbx should be pre-installed on Fedora 42, but verify:

# Check if toolbx is installed
toolbox --version

# If not installed, install it
sudo dnf install -y toolbox

Step 4: Add User to GPU Groups

sudo usermod -aG video $USER
sudo usermod -aG render $USER

Log out and log back in for group changes to take effect.

Step 5: Choose and Create Your Toolbox

Select the backend that best fits your needs:

Option A: Vulkan RADV (Recommended for Most Users)

Most stable and compatible. Works with all models.

toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

Option B: Vulkan AMDVLK (Fastest for Prompt Processing)

Fastest backend, but has 2 GiB single buffer limit (some large models won’t load).

toolbox create llama-vulkan-amdvlk \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

Option C: ROCm 6.4.4 + ROCWMMA (Best ROCm Option)

Good for BF16 models with improved flash attention. May have occasional crashes.

toolbox create llama-rocm-6.4.4-rocwmma \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

Create Multiple Toolboxes (Optional)

You can create all toolboxes to test different backends:

# Vulkan RADV
toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

# Vulkan AMDVLK
toolbox create llama-vulkan-amdvlk \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

# ROCm 6.4.4 + ROCWMMA
toolbox create llama-rocm-6.4.4-rocwmma \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

Step 6: Enter Your Toolbox

toolbox enter llama-vulkan-radv

Replace llama-vulkan-radv with your chosen toolbox name.

Your prompt should change to indicate you’re inside the toolbox.

Step 7: Verify GPU Access

Inside the toolbox, check if the GPU is accessible:

llama-cli --list-devices

You should see your AMD GPU listed.

Step 8: Download a Model

Create a models directory and download a GGUF model from HuggingFace:

# Create models directory
mkdir -p ~/models

# Install pip if not already installed
sudo dnf install -y python3-pip

# Install huggingface-cli with hf-transfer for faster downloads
pip install --user "huggingface_hub[hf_transfer]"

# Make sure ~/.local/bin is in your PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Example: Download Qwen3 Coder 30B BF16
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00002-of-00002.gguf \
  --local-dir models/qwen3-coder-30B-A3B/

# Example: Download all files in the BF16 directory
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  --include "BF16/*" \
  --local-dir models/qwen3-coder-30B-A3B/

Find more models: Models – Hugging Face

Recommended: Look for Unsloth quantizations - unsloth (Unsloth AI)

Step 9: Run Your First Model

llama-cli --no-mmap -ngl 999 -fa on \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf

Command explained:

--no-mmap - Disables memory mapping
-ngl 999 - Loads all layers to GPU
-fa on - Enables flash attention [on|off|auto]
-m - Specifies model path

start a API llama Server:

llama-server --no-mmap -ngl 999 -fa on \
  -c 131072 \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 12345 \
  --api-key "secret"

start a chat llama Server:

llama-server --no-mmap -ngl 999 -fa on \
  -c 131072 \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 8080

Step 10: Check Memory Requirements

After downloading a model, check how much RAM it will use:

gguf-vram-estimator.py ~/models/your-model.gguf

Example output for Qwen3-235B Q3_K_M:

--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
Max Context: 262,144 tokens
Model Size: 104.72 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc.)

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
          4,096 |      752.00 MiB |      107.46 GiB
          8,192 |        1.47 GiB |      108.19 GiB
         16,384 |        2.94 GiB |      109.66 GiB
         32,768 |        5.88 GiB |      112.60 GiB
         65,536 |       11.75 GiB |      118.47 GiB
        131,072 |       23.50 GiB |      130.22 GiB  ← Fits in 128 GB
        262,144 |       47.00 GiB |      153.72 GiB  ← Too large!

Reading the output:

Model Size: Base model weight size
Context Size: Number of tokens the model can process
Context Memory: Additional RAM needed for that context length
Est. Total VRAM: Total RAM required (Model + Context + Overhead)

For 128 GB systems: This Qwen3-235B model can handle up to ~130k token contexts. The maximum 262k context would require 154 GiB (too much).

You can also specify custom context sizes:

gguf-vram-estimator.py ~/models/your-model.gguf --contexts 4096 65536 131072

Updating Toolboxes

Download and use the refresh script to keep toolboxes up to date:

# Download the refresh script
curl -O https://raw.githubusercontent.com/kyuz0/amd-strix-halo-toolboxes/main/refresh-toolboxes.sh
chmod +x refresh-toolboxes.sh

# Refresh all toolboxes
./refresh-toolboxes.sh all

# Or refresh specific toolboxes
./refresh-toolboxes.sh llama-vulkan-radv

Performance Tips

Backend Selection Guide

Based on benchmarks:

Fastest prompt processing: Vulkan AMDVLK, ROCm 6.4.4 (hipBLASLt)
Fastest token generation: Vulkan RADV
Best balanced: Vulkan AMDVLK
Best for BF16 models: ROCm 6.4.4 + ROCWMMA

Memory Planning

Q4_K quantization: ~4 bits per parameter
BF16: ~16 bits per parameter
Context overhead: Varies by context size (see VRAM estimator)
System overhead: ~2 GiB additional

Example:

30B model in Q4_K ≈ 17 GiB
235B model in Q3_K ≈ 97 GiB

Troubleshooting

GPU Not Detected

# Check if devices exist
ls -l /dev/dri
ls -l /dev/kfd  # For ROCm only

# Verify group membership
groups

Model Won’t Load

Check VRAM requirements with estimator
Try different quantization (Q4_K, Q3_K, etc.)
Reduce context size with -c parameter
Switch to a different backend

Crashes with ROCm

Try Vulkan RADV instead (more stable)
Disable hipBLASLt: export ROCBLAS_USE_HIPBLASLT=0
Use ROCm 6.4.4 instead of 7 RC

Slow Performance

Verify all GPU layers loaded: -ngl 999
Enable flash attention: -fa
Try different backend (see benchmarks)
Check kernel parameters are active

Next Steps

Explore different models from HuggingFace
Try different backends to find optimal performance
Review benchmarks at AMD Strix Halo — Backend Benchmarks (Grid View)
Join the community - Strix Halo Discord server

Additional Resources

Project Repository: GitHub - kyuz0/amd-strix-halo-toolboxes
Interactive Benchmarks: AMD Strix Halo — Backend Benchmarks (Grid View)
Strix Halo Homelab: https://strixhalo-homelab.d7.wtf/
Hardware Database: https://strixhalo-homelab.d7.wtf/Hardware
Youtube Channel from Donato Capitella (@kyuz0): https://www.youtube.com/@donatocapitella/videos

Quick Reference Commands

# Enter toolbox
toolbox enter llama-vulkan-radv

# List available devices
llama-cli --list-devices

# Run model with GPU acceleration
llama-cli --no-mmap -ngl 999 -fa -m ~/models/your-model.gguf

# Estimate VRAM needs
gguf-vram-estimator.py ~/models/your-model.gguf --contexts 4096

# Download model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download model/name file.gguf --local-dir ~/models/

# Exit toolbox
exit

Installation complete! You’re now ready to run large language models locally on your AMD Strix Halo system.

changes:

added youtube channel from @kyuz0
wrong parameter for Flash Attention thanks @krom
Download Qwen3 Coder 30B BF16 command adaption
added examples for starting an API and Chat Server thanks @krom

Djip · October 4, 2025, 6:15pm

Nice.

some elements;

UMA is not rely relatted to gtt. In fact with kernel > 6.10 AMD allow the drivers on APU to alloc VRAM on GTT. But it have always be possible to alloc buffer on RAM that can be use by GPU. On APU it has no penalty with VRAM alloc. (for old kernel change GTT size did not work)
llama.cpp can alloc buffer on RAM, how it is possible have change (I don’t know when).

So with correct env you do not need to change GTT size simply add env var:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON for rocm backend (Clean-up rocm-7rc by Djip007 · Pull Request #6 · kyuz0/amd-strix-halo-toolboxes · GitHub)
GGML_VK_PREFER_HOST_MEMORY=ON for vulkan backend.

With that you do not need to change boot parameter.

(There is more we can do to make things more simple, but I need more time )

Note: https://src.fedoraproject.org/rpms/rocblas rocm6.4.4 is on testing so may be we can have even native fedora43 build

krom · October 4, 2025, 6:17pm

Thank you very much for creating this guide. It was easy to follow and works fine.
For the -fa I had to use -fa on instead and for running a server:

llama-server –no-mmap -ngl 999 -fa on \
-c 131072 \
-m <gpt oss 120b model here> \
–host 0.0.0.0 \
–port 12345 \
–api-key “secret”

Also works like a charm.
Tried with “Option A: Vulkan RADV” - Speed is good, around 384 tokens/s for prompt (5685 tokens) and 44.5 tokens/sec for response (4958 tokens).

sun · October 4, 2025, 6:33pm

Is disabling IOMMU still needed/recommended (which requires custom boot params)?

Djip · October 4, 2025, 9:05pm

some report a small gain in performance (<5%?) at the cost of deactivating certain features.

If you know what consequences it could have, you could try it, otherwise I would not recommend it.
In any case, it is not necessary for it to work. so:

needed? => NO
recommended? => don’t know

Lars_Urban · October 5, 2025, 9:00am

Thank you for testing this Guide ! Added your valid points to the Guide

James3 · October 5, 2025, 9:44am

To help reduce ROCM crashes.

I have made some more progress in this thread:

Specifically, with:
amdgpu.cwsr_enable=0
or a file in /etc/modprobe.d:
options amdgpu cwsr_enable=0

It seems to fix the ROCM gpu crash problems observed in ROCM 6.x ans 7.x

Lars_Urban · October 5, 2025, 12:18pm

Also valid points, but only comes with new kernel versions.
The easy path is what i have summarized above.

example:

github.com/ROCm/ROCm

[SOLVED] Strix Halo (gfx1151) - ROCm only seeing 15.5GB instead of allocated VRAM

opened 02:18PM - 29 Sep 25 UTC

saross

### Problem Description Strix Halo systems (gfx1151) with large VRAM allocations… are limited to only ~15.5GB visible memory in the ROCm/HIP runtime, despite the kernel correctly seeing the full allocation. This is affecting all users with Ryzen AI MAX+ processors trying to use the unified memory architecture for LLM workloads. ### Affected Configuration - **Hardware:** AMD Ryzen AI MAX+ (Strix Halo) with Radeon 8060S (gfx1151) - **Memory:** Any system with >16GB VRAM allocated (tested with 96GB allocation) - **ROCm versions:** Affects 6.4.1 through 7.0 RC - **Symptom:** ROCm applications can only allocate ~15.5GB despite larger VRAM allocation ### Reproduction On any affected system with kernel ≤6.15: ```bash # Check kernel sees full VRAM (example: 96GB) $ cat /sys/class/drm/card*/device/mem_info_vram_total 103079215104 # 96GB in bytes # But ROCm only sees 15.5GB $ rocminfo | grep -A3 "Pool" | grep Size Size: 16651264(0x3e2000) KB # Only 15.5GB! # HIP applications fail to allocate beyond 15.5GB $ hipMemGetInfo # Returns ~15.5GB total ``` ### Solution **Upgrade to kernel 6.16.9 or later.** This is a kernel-level fix, not a ROCm issue. ```bash # Ubuntu/Debian users can use mainline kernel: sudo add-apt-repository ppa:cappelikan/ppa sudo apt update && sudo apt install mainline sudo mainline --install 6.16.9 sudo update-initramfs -c -k 6.16.9-061609-generic sudo reboot # Fedora users: # Install kernel 6.16.9 from rawhide or testing repos ``` ### Verification After Fix ```bash # After kernel 6.16.9: $ uname -r 6.16.9-061609-generic # NO kernel parameters needed! $ cat /proc/cmdline root=UUID=72998fb2-b0eb-4676-97c9-31ac53b5e2a5 ro quiet splash rd.luks.options=tpm2-device=auto # Note: No amdgpu.gttsize, no ttm.pages_limit, no amd_iommu=off $ rocminfo | grep "Size:" | grep "100663296" Size: 100663296(0x6000000) KB # 96GB visible! # Ollama now uses full memory $ OLLAMA_GPU_MEMORY=96GB ollama run llama3.3:70b # Works! ``` ### Technical Details The kernel 6.16.x series appears to include fixes for: - Unified memory architecture (UMA) handling for APUs - HSA memory pool detection on gfx1151 - Proper VRAM aperture mapping for Strix Halo ### Testing Configuration - **System:** HP ZBook Ultra G1a with Ryzen AI MAX+ PRO 395 - **VRAM allocation:** 96GB (not yet tested with other allocations) - **ROCm version:** 6.4.1 (ROCm 7.0 not yet tested) - **Kernel tested:** 6.16.9-061609-generic on Ubuntu 24.04.1 LTS ### Expected Impact This fix enables: - Running 70B+ parameter models that require >15GB VRAM - Full utilization of Strix Halo's unified memory architecture - No performance penalties from GTT workarounds - Native ROCm performance without hacks ### Related Discussions - Performance issues (separate): #4748 - Community workarounds: https://strixhalo-homelab.d7.wtf/ - Framework forum: https://community.frame.work/t/amd-strix-halo-ryzen-ai-max-395-gpu-llm-performance-tests/72521 ### Notes - Kernel 6.15.x and earlier: Issue persists - Kernel 6.16.9+: Issue resolved - ROCm 6.4.1 works perfectly with the new kernel (7.0 not required) - No GTT expansion parameters needed ### Community Testing Needed Please test and report back with: - Different VRAM allocations (32GB, 64GB, 128GB) - Different Strix Halo systems (Framework, GMKtec, ASUS, etc.) - ROCm 7.0 compatibility - Other distributions (Fedora, Arch, etc.) This fix should theoretically work for all configurations, but more testing will help confirm. --- *System verified working: HP ZBook Ultra G1a, 128GB RAM (96GB VRAM), Ubuntu 24.04.1, ROCm 6.4.1*

I also added the removal command for later upcoming changes
Thanks for the insides !

ps.: At least this was my idea … can`t edit my org. post anymore
here the command if you have followed this guide:

sudo grubby --update-kernel=ALL --remove-args='amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432'

Djip · October 5, 2025, 2:17pm

For beginners (and without kyuz0 update), you can:
skip point 1.1: ( Add Kernel Parameters Using grubby)

and on rocm container start llama.cpp with:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON llama-server --no-mmap -ngl 999 -fa on \
  -c 131072 \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 8080

=> this have to work without GTT change even on old kernel.

Note: I’ll try to create “simple” Images, but for now have hard time figure what we really have to do for WMMA on rocm-6.4.4 … need more test

Andrew_Lake · October 6, 2025, 8:37pm

Very cool, thanks for putting this together! For some reason when I get to installing python3-pip in the toolbox I get the following

$ sudo dnf install -y python3-pip
sudo: unable to open /etc/sudoers: No such file or directory
sudo: error initializing audit plugin sudoers_audit

I’m trying to figure out why toolbox isn’t seeing the /etc/sudoers file but I’m at a loss. Any ideas?

Guest379 · October 7, 2025, 2:10am

It’s been a minute since I’ve used a toolbox so forgive me if I’m wrong here but have you tried without sudo? Toolboxes are container-based sandboxes so even an unprivileged user should be able to install software inside of one.

Djip · October 8, 2025, 12:19am

These are not real “toolbox” images, so it may not work as expected. But as suggested, you can try without sudo.

Andrew_Lake · October 8, 2025, 1:05am

Yup, I tried without sudo but it seems to require superuser previlidges:

$ dnf install -y python3-pip
The requested operation requires superuser privileges. Please log in as a user with elevated rights, or use the "--assumeno" or "--downloadonly" options to run the command without modifying the system state.

Thanks so much for the effort to help. I’ll keep looking for ways to get through step 8. I welcome any insight if someone figures it out.

Lars_Urban · October 8, 2025, 6:04am

Sorry for the inconvenience!
The commands in step 8 are outside the toolbox.

Andrew_Lake · October 8, 2025, 6:41pm

Ah, thanks so much! Don’t know why that didn’t occur to me!

Djip · October 8, 2025, 8:51pm

Oops, I missed that.

Otherwise, for those who don’t like modifying their OS too much for step 8, I prefer to create a virtual environment:

# create virtual env:
python -m venv huggingface
source huggingface/bin/activate
python -m pip install --upgrade pip

# download model:
mkdir -p ~/models
hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  --include "BF16/*" \
  --local-dir models/qwen3-coder-30B-A3B/

Germain_Perez · November 10, 2025, 12:51am

Hi. Thanks for the guide! I got lost in the forest on my first try but got Qwen3 Q8 XL running on my 64GB desktop on the second. I wrote down my steps (some of which are unique to my setup) and want to share here:

Install from USB boot loader, select in BIOS. See Framework website for Fedora 43 install ( Fedora 43 Installation on the Framework Desktop - Framework Guides )
From ( linux-docs/framework-desktop/Fedora-all.md at main · FrameworkComputer/linux-docs · GitHub ) : $ sudo dnf upgrade (then reboot)
Install llama.cpp on Fedora ( AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 )
1. $ sudo grubby --update-kernel=ALL --args='amd_iommu=off amdgpu.gttsize=49152 ttm.pages_limit=12288000’ (for 64gb ram, search google for ttm.pages_limit calc)
  1. Verify: $ sudo grubby --info=ALL | grep args
  2. Reboot
  3. Verify after reboot: $ cat /proc/cmdline
2. The BIOS setting for allocated iGPU should be default, 512MB (0.5 GB) minimum
3. Check if toolbox installed: $ toolbox —version
4. Add user to GPU groups:
  1. $ sudo user mod -aG video $USER
  2. $ sudo user mod -aG render $USER
5. Choose and create a toolbox:
  1. Create some boxed backends, ie:
    1. $ toolbox create llama-rocm-6.4.4-rocwmma \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \ – --device /dev/dri --device /dev/kfd \ --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
    2. $ toolbox create llama-vulkan-radv \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \ – --device /dev/dri --group-add video --security-opt seccomp=unconfined
6. Enter the toolbox: $ toolbox enter llama-rocm-6.4.4-rocwmma
  1. Inside toolbox, verify: $ llama-cli —list-devices
  2. ‘exit’ the toolbox
7. Download a model:
  1. Create a models dir: $ mkdir -p ~/Development/ai/models
  2. Install pip: $ sudo dnf install -y python3-pip
  3. Install hugging face-cli: $ pip install --user “huggingface_hub[hf_transfer]”
  4. Make sure ~/.local/bin is in your PATH:
    1. $ echo ‘export PATH=“$HOME/.local/bin:$PATH”’ >> ~/.bashrc
    2. $ source ~/.bashrc
  5. Actually download the model, ie
    1. $ HF_HUB_ENABLE_HF_TRANSFER=0 huggingface-cli download unsloth/Qwen3-30B-A3B-GGUF \ Qwen3-30B-A3B-UD-Q8_K_XL.gguf \ --local-dir Development/ai/models/qwen3-30B-A3B-Q8_K_XL/
8. Run the model
  1. $ toolbox enter llama-rocm-6.4.4-rocwmma
  2. $ llama-cli --no-mmap -ngl 999 \ -m ~/Development/ai/models/qwen3-30B-A3B-Q8_K_XL/Qwen3-30B-A3B-UD-Q8_K_XL.gguf
  3. ‘exit’ toolbox when done
9. Should you want to return memory allocations to their defaults (like to play games or use other memory intensive apps?):
  1. $ sudo grubby --update-kernel=ALL --remove-args=‘amd_iommu=off amdgpu.gttsize ttm.pages_limit’
  2. Then reboot. To go back to using ai models, run step (3.1) again. Can go back-n-forth

Richard2 · January 14, 2026, 7:12pm

Hey there! Just a heads up, besides updating the ttm parameters, we should also build llama with AMDGPU_TARGETS="gfx1151" args. I’ve created a GitHub repo for building llama.cpp specifically for Strix Halo. Feel free to use, fork, or contribute to it if you’d like! GitHub - Lychee-Technology/llama-cpp-for-strix-halo: This repository builds llama.cpp for Strix Halo devices.

YoelFievel · January 14, 2026, 10:14pm

Not sure what’s going on with your kernel parameters but if you want to utilize all the memory, use options amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432. You can shave a couple gigs off if you want to play it safe, but I personally go headless when running large models, so those are the numbers I use.

Topic		Replies	Views
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	56	6978	February 2, 2026
AMD Strix Halo (Ryzen AI Max+ 395) GPU LLM Performance Tests Framework Desktop ai	17	14957	September 29, 2025
[HOW-TO] Compiling VLLM from source on Strix Halo Framework Desktop ai	59	4769	January 7, 2026
Linux + ROCm: January 2026 Stable Configurations Update Linux fedora	24	1077	February 1, 2026
Ryzen AI "Max" -- not so much? Framework Desktop	23	2022	December 2, 2025

AMD Strix Halo Llama.cpp Installation Guide for Fedora 42

AMD Strix Halo Llama.cpp Installation Guide for Fedora 42

Prerequisites

Step 1: Configure Kernel Parameters

1.1 Add Kernel Parameters Using grubby

1.2 Verify the Parameters Were Added

1.3 Reboot

1.4 Verify Kernel Parameters (After Reboot)

Step 2: Configure BIOS Settings

Step 3: Install Toolbx

Step 4: Add User to GPU Groups

Step 5: Choose and Create Your Toolbox

Option A: Vulkan RADV (Recommended for Most Users)

Option B: Vulkan AMDVLK (Fastest for Prompt Processing)

Option C: ROCm 6.4.4 + ROCWMMA (Best ROCm Option)

Create Multiple Toolboxes (Optional)

Step 6: Enter Your Toolbox

Step 7: Verify GPU Access

Step 8: Download a Model

Step 9: Run Your First Model

Step 10: Check Memory Requirements

Updating Toolboxes

Performance Tips

Backend Selection Guide

Memory Planning

Troubleshooting

GPU Not Detected

Model Won’t Load

Crashes with ROCm

Slow Performance

Next Steps

Additional Resources

Quick Reference Commands

Related topics