AMD Strix Halo Llama.cpp Installation Guide for Fedora 42

AMD Strix Halo Llama.cpp Installation Guide for Fedora 42

This guide walks you through setting up LLM inference on AMD Ryzen AI Max “Strix Halo” integrated GPUs using Fedora 42.

Prerequisites

  • Fresh Fedora 42 installation
  • AMD Ryzen AI Max processor (Strix Halo)
  • At least 128 GB RAM recommended for large models
  • Internet connection

Step 1: Configure Kernel Parameters

These parameters enable unified memory and optimal GPU performance.

1.1 Add Kernel Parameters Using grubby

Fedora 42 uses grubby to manage kernel parameters. Run this single command:

sudo grubby --update-kernel=ALL --args='amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432'

What these parameters do:

  • amd_iommu=off - Disables IOMMU for lower latency
  • amdgpu.gttsize=131072 - Enables unified GPU/system memory (128 GiB)
  • ttm.pages_limit=33554432 - Allows large pinned memory allocations (128 GiB)

1.2 Verify the Parameters Were Added

sudo grubby --info=ALL | grep args

You should see your added parameters in the output.

1.3 Reboot

sudo reboot

1.4 Verify Kernel Parameters (After Reboot)

cat /proc/cmdline

You should see your added parameters in the output.

Step 2: Configure BIOS Settings

Before proceeding, configure your BIOS:

  1. Reboot and enter BIOS setup
  2. Find GPU memory allocation settings
  3. Set GPU Memory to 512 MB (minimum required)
  4. Save and exit

Step 3: Install Toolbx

Toolbx should be pre-installed on Fedora 42, but verify:

# Check if toolbx is installed
toolbox --version

# If not installed, install it
sudo dnf install -y toolbox

Step 4: Add User to GPU Groups

sudo usermod -aG video $USER
sudo usermod -aG render $USER

Log out and log back in for group changes to take effect.

Step 5: Choose and Create Your Toolbox

Select the backend that best fits your needs:

Option A: Vulkan RADV (Recommended for Most Users)

Most stable and compatible. Works with all models.

toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

Option B: Vulkan AMDVLK (Fastest for Prompt Processing)

Fastest backend, but has 2 GiB single buffer limit (some large models won’t load).

toolbox create llama-vulkan-amdvlk \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

Option C: ROCm 6.4.4 + ROCWMMA (Best ROCm Option)

Good for BF16 models with improved flash attention. May have occasional crashes.

toolbox create llama-rocm-6.4.4-rocwmma \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

Create Multiple Toolboxes (Optional)

You can create all toolboxes to test different backends:

# Vulkan RADV
toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

# Vulkan AMDVLK
toolbox create llama-vulkan-amdvlk \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

# ROCm 6.4.4 + ROCWMMA
toolbox create llama-rocm-6.4.4-rocwmma \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined

Step 6: Enter Your Toolbox

toolbox enter llama-vulkan-radv

Replace llama-vulkan-radv with your chosen toolbox name.

Your prompt should change to indicate you’re inside the toolbox.

Step 7: Verify GPU Access

Inside the toolbox, check if the GPU is accessible:

llama-cli --list-devices

You should see your AMD GPU listed.

Step 8: Download a Model

Create a models directory and download a GGUF model from HuggingFace:

# Create models directory
mkdir -p ~/models

# Install pip if not already installed
sudo dnf install -y python3-pip

# Install huggingface-cli with hf-transfer for faster downloads
pip install --user "huggingface_hub[hf_transfer]"

# Make sure ~/.local/bin is in your PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Example: Download Qwen3 Coder 30B BF16
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00002-of-00002.gguf \
  --local-dir models/qwen3-coder-30B-A3B/

# Example: Download all files in the BF16 directory
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  --include "BF16/*" \
  --local-dir models/qwen3-coder-30B-A3B/

Find more models: Models – Hugging Face

Recommended: Look for Unsloth quantizations - unsloth (Unsloth AI)

Step 9: Run Your First Model

llama-cli --no-mmap -ngl 999 -fa on \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf

Command explained:

  • --no-mmap - Disables memory mapping
  • -ngl 999 - Loads all layers to GPU
  • -fa on - Enables flash attention [on|off|auto]
  • -m - Specifies model path

start a API llama Server:

llama-server --no-mmap -ngl 999 -fa on \
  -c 131072 \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 12345 \
  --api-key "secret"

start a chat llama Server:

llama-server --no-mmap -ngl 999 -fa on \
  -c 131072 \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 8080

Step 10: Check Memory Requirements

After downloading a model, check how much RAM it will use:

gguf-vram-estimator.py ~/models/your-model.gguf

Example output for Qwen3-235B Q3_K_M:

--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
Max Context: 262,144 tokens
Model Size: 104.72 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc.)

--- Memory Footprint Estimation ---
   Context Size |  Context Memory | Est. Total VRAM
---------------------------------------------------
          4,096 |      752.00 MiB |      107.46 GiB
          8,192 |        1.47 GiB |      108.19 GiB
         16,384 |        2.94 GiB |      109.66 GiB
         32,768 |        5.88 GiB |      112.60 GiB
         65,536 |       11.75 GiB |      118.47 GiB
        131,072 |       23.50 GiB |      130.22 GiB  ← Fits in 128 GB
        262,144 |       47.00 GiB |      153.72 GiB  ← Too large!

Reading the output:

  • Model Size: Base model weight size
  • Context Size: Number of tokens the model can process
  • Context Memory: Additional RAM needed for that context length
  • Est. Total VRAM: Total RAM required (Model + Context + Overhead)

For 128 GB systems: This Qwen3-235B model can handle up to ~130k token contexts. The maximum 262k context would require 154 GiB (too much).

You can also specify custom context sizes:

gguf-vram-estimator.py ~/models/your-model.gguf --contexts 4096 65536 131072

Updating Toolboxes

Download and use the refresh script to keep toolboxes up to date:

# Download the refresh script
curl -O https://raw.githubusercontent.com/kyuz0/amd-strix-halo-toolboxes/main/refresh-toolboxes.sh
chmod +x refresh-toolboxes.sh

# Refresh all toolboxes
./refresh-toolboxes.sh all

# Or refresh specific toolboxes
./refresh-toolboxes.sh llama-vulkan-radv

Performance Tips

Backend Selection Guide

Based on benchmarks:

  • Fastest prompt processing: Vulkan AMDVLK, ROCm 6.4.4 (hipBLASLt)
  • Fastest token generation: Vulkan RADV
  • Best balanced: Vulkan AMDVLK
  • Best for BF16 models: ROCm 6.4.4 + ROCWMMA

Memory Planning

  • Q4_K quantization: ~4 bits per parameter
  • BF16: ~16 bits per parameter
  • Context overhead: Varies by context size (see VRAM estimator)
  • System overhead: ~2 GiB additional

Example:

  • 30B model in Q4_K ≈ 17 GiB
  • 235B model in Q3_K ≈ 97 GiB

Troubleshooting

GPU Not Detected

# Check if devices exist
ls -l /dev/dri
ls -l /dev/kfd  # For ROCm only

# Verify group membership
groups

Model Won’t Load

  1. Check VRAM requirements with estimator
  2. Try different quantization (Q4_K, Q3_K, etc.)
  3. Reduce context size with -c parameter
  4. Switch to a different backend

Crashes with ROCm

  • Try Vulkan RADV instead (more stable)
  • Disable hipBLASLt: export ROCBLAS_USE_HIPBLASLT=0
  • Use ROCm 6.4.4 instead of 7 RC

Slow Performance

  1. Verify all GPU layers loaded: -ngl 999
  2. Enable flash attention: -fa
  3. Try different backend (see benchmarks)
  4. Check kernel parameters are active

Next Steps

  1. Explore different models from HuggingFace
  2. Try different backends to find optimal performance
  3. Review benchmarks at AMD Strix Halo — Backend Benchmarks (Grid View)
  4. Join the community - Strix Halo Discord server

Additional Resources

Quick Reference Commands

# Enter toolbox
toolbox enter llama-vulkan-radv

# List available devices
llama-cli --list-devices

# Run model with GPU acceleration
llama-cli --no-mmap -ngl 999 -fa -m ~/models/your-model.gguf

# Estimate VRAM needs
gguf-vram-estimator.py ~/models/your-model.gguf --contexts 4096

# Download model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download model/name file.gguf --local-dir ~/models/

# Exit toolbox
exit

Installation complete! You’re now ready to run large language models locally on your AMD Strix Halo system.

changes:

  • added youtube channel from @kyuz0
  • wrong parameter for Flash Attention thanks @krom
  • Download Qwen3 Coder 30B BF16 command adaption
  • added examples for starting an API and Chat Server thanks @krom
13 Likes

Nice.

some elements;

  • UMA is not rely relatted to gtt. In fact with kernel > 6.10 AMD allow the drivers on APU to alloc VRAM on GTT. But it have always be possible to alloc buffer on RAM that can be use by GPU. On APU it has no penalty with VRAM alloc. (for old kernel change GTT size did not work)
    llama.cpp can alloc buffer on RAM, how it is possible have change (I don’t know when).

So with correct env you do not need to change GTT size simply add env var:

With that you do not need to change boot parameter.

(There is more we can do to make things more simple, but I need more time :wink: )

Note: https://src.fedoraproject.org/rpms/rocblas rocm6.4.4 is on testing so may be we can have even native fedora43 build :crossed_fingers:

3 Likes

Thank you very much for creating this guide. It was easy to follow and works fine.
For the -fa I had to use -fa on instead and for running a server:

llama-server –no-mmap -ngl 999 -fa on \
-c 131072 \
-m <gpt oss 120b model here> \
–host 0.0.0.0 \
–port 12345 \
–api-key “secret”

Also works like a charm.
Tried with “Option A: Vulkan RADV” - Speed is good, around 384 tokens/s for prompt (5685 tokens) and 44.5 tokens/sec for response (4958 tokens).

2 Likes

Is disabling IOMMU still needed/recommended (which requires custom boot params)?

some report a small gain in performance (<5%?) at the cost of deactivating certain features.

If you know what consequences it could have, you could try it, otherwise I would not recommend it.
In any case, it is not necessary for it to work. so:

  • needed? => NO
  • recommended? => don’t know :wink:
2 Likes

Thank you for testing this Guide ! Added your valid points to the Guide :wink:

To help reduce ROCM crashes.

I have made some more progress in this thread:

Specifically, with:
amdgpu.cwsr_enable=0
or a file in /etc/modprobe.d:
options amdgpu cwsr_enable=0

It seems to fix the ROCM gpu crash problems observed in ROCM 6.x ans 7.x

1 Like

Also valid points, but only comes with new kernel versions.
The easy path is what i have summarized above.

example:

I also added the removal command for later upcoming changes :wink:
Thanks for the insides !

ps.: At least this was my idea … can`t edit my org. post anymore
here the command if you have followed this guide:

sudo grubby --update-kernel=ALL --remove-args='amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432'

For beginners (and without kyuz0 update), you can:
skip point 1.1: ( Add Kernel Parameters Using grubby)

and on rocm container start llama.cpp with:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON llama-server --no-mmap -ngl 999 -fa on \
  -c 131072 \
  -m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
  --host 0.0.0.0 \
  --port 8080

=> this have to work without GTT change even on old kernel.

Note: I’ll try to create “simple” Images, but for now have hard time figure what we really have to do for WMMA on rocm-6.4.4 … need more test :wink:

Very cool, thanks for putting this together! For some reason when I get to installing python3-pip in the toolbox I get the following

$ sudo dnf install -y python3-pip
sudo: unable to open /etc/sudoers: No such file or directory
sudo: error initializing audit plugin sudoers_audit

I’m trying to figure out why toolbox isn’t seeing the /etc/sudoers file but I’m at a loss. Any ideas?

It’s been a minute since I’ve used a toolbox so forgive me if I’m wrong here but have you tried without sudo? Toolboxes are container-based sandboxes so even an unprivileged user should be able to install software inside of one.

1 Like

These are not real “toolbox” images, so it may not work as expected. But as suggested, you can try without sudo.

Yup, I tried without sudo but it seems to require superuser previlidges:

$ dnf install -y python3-pip
The requested operation requires superuser privileges. Please log in as a user with elevated rights, or use the "--assumeno" or "--downloadonly" options to run the command without modifying the system state.

Thanks so much for the effort to help. I’ll keep looking for ways to get through step 8. I welcome any insight if someone figures it out.

1 Like

Sorry for the inconvenience!
The commands in step 8 are outside the toolbox. :confused:

1 Like

Ah, thanks so much! Don’t know why that didn’t occur to me! :slight_smile:

1 Like

Oops, I missed that.

Otherwise, for those who don’t like modifying their OS too much for step 8, I prefer to create a virtual environment:

# create virtual env:
python -m venv huggingface
source huggingface/bin/activate
python -m pip install --upgrade pip

# download model:
mkdir -p ~/models
hf download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  --include "BF16/*" \
  --local-dir models/qwen3-coder-30B-A3B/
2 Likes

Hi. Thanks for the guide! I got lost in the forest on my first try but got Qwen3 Q8 XL running on my 64GB desktop on the second. I wrote down my steps (some of which are unique to my setup) and want to share here:

  1. Install from USB boot loader, select in BIOS. See Framework website for Fedora 43 install ( Fedora 43 Installation on the Framework Desktop - Framework Guides )
  2. From ( linux-docs/framework-desktop/Fedora-all.md at main · FrameworkComputer/linux-docs · GitHub ) : $ sudo dnf upgrade (then reboot)
  3. Install llama.cpp on Fedora ( AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 )
    1. $ sudo grubby --update-kernel=ALL --args='amd_iommu=off amdgpu.gttsize=49152 ttm.pages_limit=12288000’ (for 64gb ram, search google for ttm.pages_limit calc)
      1. Verify: $ sudo grubby --info=ALL | grep args
      2. Reboot
      3. Verify after reboot: $ cat /proc/cmdline
    2. The BIOS setting for allocated iGPU should be default, 512MB (0.5 GB) minimum
    3. Check if toolbox installed: $ toolbox —version
    4. Add user to GPU groups:
      1. $ sudo user mod -aG video $USER
      2. $ sudo user mod -aG render $USER
    5. Choose and create a toolbox:
      1. Create some boxed backends, ie:
        1. $ toolbox create llama-rocm-6.4.4-rocwmma \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \ – --device /dev/dri --device /dev/kfd \ --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
        2. $ toolbox create llama-vulkan-radv \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \ – --device /dev/dri --group-add video --security-opt seccomp=unconfined
    6. Enter the toolbox: $ toolbox enter llama-rocm-6.4.4-rocwmma
      1. Inside toolbox, verify: $ llama-cli —list-devices
      2. ‘exit’ the toolbox
    7. Download a model:
      1. Create a models dir: $ mkdir -p ~/Development/ai/models
      2. Install pip: $ sudo dnf install -y python3-pip
      3. Install hugging face-cli: $ pip install --user “huggingface_hub[hf_transfer]”
      4. Make sure ~/.local/bin is in your PATH:
        1. $ echo ‘export PATH=“$HOME/.local/bin:$PATH”’ >> ~/.bashrc
        2. $ source ~/.bashrc
      5. Actually download the model, ie
        1. $ HF_HUB_ENABLE_HF_TRANSFER=0 huggingface-cli download unsloth/Qwen3-30B-A3B-GGUF \ Qwen3-30B-A3B-UD-Q8_K_XL.gguf \ --local-dir Development/ai/models/qwen3-30B-A3B-Q8_K_XL/
    8. Run the model
      1. $ toolbox enter llama-rocm-6.4.4-rocwmma
      2. $ llama-cli --no-mmap -ngl 999 \ -m ~/Development/ai/models/qwen3-30B-A3B-Q8_K_XL/Qwen3-30B-A3B-UD-Q8_K_XL.gguf
      3. ‘exit’ toolbox when done
    9. Should you want to return memory allocations to their defaults (like to play games or use other memory intensive apps?):
      1. $ sudo grubby --update-kernel=ALL --remove-args=‘amd_iommu=off amdgpu.gttsize ttm.pages_limit’
      2. Then reboot. To go back to using ai models, run step (3.1) again. Can go back-n-forth

Hey there! Just a heads up, besides updating the ttm parameters, we should also build llama with AMDGPU_TARGETS="gfx1151" args. I’ve created a GitHub repo for building llama.cpp specifically for Strix Halo. Feel free to use, fork, or contribute to it if you’d like! :smiley: GitHub - Lychee-Technology/llama-cpp-for-strix-halo: This repository builds llama.cpp for Strix Halo devices.

Not sure what’s going on with your kernel parameters but if you want to utilize all the memory, use options amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432. You can shave a couple gigs off if you want to play it safe, but I personally go headless when running large models, so those are the numbers I use.