AMD Strix Halo Llama.cpp Installation Guide for Fedora 42
This guide walks you through setting up LLM inference on AMD Ryzen AI Max “Strix Halo” integrated GPUs using Fedora 42.
Prerequisites
- Fresh Fedora 42 installation
- AMD Ryzen AI Max processor (Strix Halo)
- At least 128 GB RAM recommended for large models
- Internet connection
Step 1: Configure Kernel Parameters
These parameters enable unified memory and optimal GPU performance.
1.1 Add Kernel Parameters Using grubby
Fedora 42 uses grubby to manage kernel parameters. Run this single command:
sudo grubby --update-kernel=ALL --args='amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432'
What these parameters do:
amd_iommu=off- Disables IOMMU for lower latencyamdgpu.gttsize=131072- Enables unified GPU/system memory (128 GiB)ttm.pages_limit=33554432- Allows large pinned memory allocations (128 GiB)
1.2 Verify the Parameters Were Added
sudo grubby --info=ALL | grep args
You should see your added parameters in the output.
1.3 Reboot
sudo reboot
1.4 Verify Kernel Parameters (After Reboot)
cat /proc/cmdline
You should see your added parameters in the output.
Step 2: Configure BIOS Settings
Before proceeding, configure your BIOS:
- Reboot and enter BIOS setup
- Find GPU memory allocation settings
- Set GPU Memory to 512 MB (minimum required)
- Save and exit
Step 3: Install Toolbx
Toolbx should be pre-installed on Fedora 42, but verify:
# Check if toolbx is installed
toolbox --version
# If not installed, install it
sudo dnf install -y toolbox
Step 4: Add User to GPU Groups
sudo usermod -aG video $USER
sudo usermod -aG render $USER
Log out and log back in for group changes to take effect.
Step 5: Choose and Create Your Toolbox
Select the backend that best fits your needs:
Option A: Vulkan RADV (Recommended for Most Users)
Most stable and compatible. Works with all models.
toolbox create llama-vulkan-radv \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
-- --device /dev/dri --group-add video --security-opt seccomp=unconfined
Option B: Vulkan AMDVLK (Fastest for Prompt Processing)
Fastest backend, but has 2 GiB single buffer limit (some large models won’t load).
toolbox create llama-vulkan-amdvlk \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk \
-- --device /dev/dri --group-add video --security-opt seccomp=unconfined
Option C: ROCm 6.4.4 + ROCWMMA (Best ROCm Option)
Good for BF16 models with improved flash attention. May have occasional crashes.
toolbox create llama-rocm-6.4.4-rocwmma \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
Create Multiple Toolboxes (Optional)
You can create all toolboxes to test different backends:
# Vulkan RADV
toolbox create llama-vulkan-radv \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
-- --device /dev/dri --group-add video --security-opt seccomp=unconfined
# Vulkan AMDVLK
toolbox create llama-vulkan-amdvlk \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-amdvlk \
-- --device /dev/dri --group-add video --security-opt seccomp=unconfined
# ROCm 6.4.4 + ROCWMMA
toolbox create llama-rocm-6.4.4-rocwmma \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined
Step 6: Enter Your Toolbox
toolbox enter llama-vulkan-radv
Replace llama-vulkan-radv with your chosen toolbox name.
Your prompt should change to indicate you’re inside the toolbox.
Step 7: Verify GPU Access
Inside the toolbox, check if the GPU is accessible:
llama-cli --list-devices
You should see your AMD GPU listed.
Step 8: Download a Model
Create a models directory and download a GGUF model from HuggingFace:
# Create models directory
mkdir -p ~/models
# Install pip if not already installed
sudo dnf install -y python3-pip
# Install huggingface-cli with hf-transfer for faster downloads
pip install --user "huggingface_hub[hf_transfer]"
# Make sure ~/.local/bin is in your PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Example: Download Qwen3 Coder 30B BF16
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00002-of-00002.gguf \
--local-dir models/qwen3-coder-30B-A3B/
# Example: Download all files in the BF16 directory
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
--include "BF16/*" \
--local-dir models/qwen3-coder-30B-A3B/
Find more models: Models – Hugging Face
Recommended: Look for Unsloth quantizations - unsloth (Unsloth AI)
Step 9: Run Your First Model
llama-cli --no-mmap -ngl 999 -fa on \
-m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf
Command explained:
--no-mmap- Disables memory mapping-ngl 999- Loads all layers to GPU-fa on- Enables flash attention [on|off|auto]-m- Specifies model path
start a API llama Server:
llama-server --no-mmap -ngl 999 -fa on \
-c 131072 \
-m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
--host 0.0.0.0 \
--port 12345 \
--api-key "secret"
start a chat llama Server:
llama-server --no-mmap -ngl 999 -fa on \
-c 131072 \
-m ~/models/qwen3-coder-30B-A3B/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
--host 0.0.0.0 \
--port 8080
Step 10: Check Memory Requirements
After downloading a model, check how much RAM it will use:
gguf-vram-estimator.py ~/models/your-model.gguf
Example output for Qwen3-235B Q3_K_M:
--- Model 'Qwen3-235B-A22B-Instruct-2507' ---
Max Context: 262,144 tokens
Model Size: 104.72 GiB (from file size)
Incl. Overhead: 2.00 GiB (for compute buffer, etc.)
--- Memory Footprint Estimation ---
Context Size | Context Memory | Est. Total VRAM
---------------------------------------------------
4,096 | 752.00 MiB | 107.46 GiB
8,192 | 1.47 GiB | 108.19 GiB
16,384 | 2.94 GiB | 109.66 GiB
32,768 | 5.88 GiB | 112.60 GiB
65,536 | 11.75 GiB | 118.47 GiB
131,072 | 23.50 GiB | 130.22 GiB ← Fits in 128 GB
262,144 | 47.00 GiB | 153.72 GiB ← Too large!
Reading the output:
- Model Size: Base model weight size
- Context Size: Number of tokens the model can process
- Context Memory: Additional RAM needed for that context length
- Est. Total VRAM: Total RAM required (Model + Context + Overhead)
For 128 GB systems: This Qwen3-235B model can handle up to ~130k token contexts. The maximum 262k context would require 154 GiB (too much).
You can also specify custom context sizes:
gguf-vram-estimator.py ~/models/your-model.gguf --contexts 4096 65536 131072
Updating Toolboxes
Download and use the refresh script to keep toolboxes up to date:
# Download the refresh script
curl -O https://raw.githubusercontent.com/kyuz0/amd-strix-halo-toolboxes/main/refresh-toolboxes.sh
chmod +x refresh-toolboxes.sh
# Refresh all toolboxes
./refresh-toolboxes.sh all
# Or refresh specific toolboxes
./refresh-toolboxes.sh llama-vulkan-radv
Performance Tips
Backend Selection Guide
Based on benchmarks:
- Fastest prompt processing: Vulkan AMDVLK, ROCm 6.4.4 (hipBLASLt)
- Fastest token generation: Vulkan RADV
- Best balanced: Vulkan AMDVLK
- Best for BF16 models: ROCm 6.4.4 + ROCWMMA
Memory Planning
- Q4_K quantization: ~4 bits per parameter
- BF16: ~16 bits per parameter
- Context overhead: Varies by context size (see VRAM estimator)
- System overhead: ~2 GiB additional
Example:
- 30B model in Q4_K ≈ 17 GiB
- 235B model in Q3_K ≈ 97 GiB
Troubleshooting
GPU Not Detected
# Check if devices exist
ls -l /dev/dri
ls -l /dev/kfd # For ROCm only
# Verify group membership
groups
Model Won’t Load
- Check VRAM requirements with estimator
- Try different quantization (Q4_K, Q3_K, etc.)
- Reduce context size with
-cparameter - Switch to a different backend
Crashes with ROCm
- Try Vulkan RADV instead (more stable)
- Disable hipBLASLt:
export ROCBLAS_USE_HIPBLASLT=0 - Use ROCm 6.4.4 instead of 7 RC
Slow Performance
- Verify all GPU layers loaded:
-ngl 999 - Enable flash attention:
-fa - Try different backend (see benchmarks)
- Check kernel parameters are active
Next Steps
- Explore different models from HuggingFace
- Try different backends to find optimal performance
- Review benchmarks at AMD Strix Halo — Backend Benchmarks (Grid View)
- Join the community - Strix Halo Discord server
Additional Resources
- Project Repository: GitHub - kyuz0/amd-strix-halo-toolboxes
- Interactive Benchmarks: AMD Strix Halo — Backend Benchmarks (Grid View)
- Strix Halo Homelab: https://strixhalo-homelab.d7.wtf/
- Hardware Database: https://strixhalo-homelab.d7.wtf/Hardware
- Youtube Channel from Donato Capitella (@kyuz0): https://www.youtube.com/@donatocapitella/videos
Quick Reference Commands
# Enter toolbox
toolbox enter llama-vulkan-radv
# List available devices
llama-cli --list-devices
# Run model with GPU acceleration
llama-cli --no-mmap -ngl 999 -fa -m ~/models/your-model.gguf
# Estimate VRAM needs
gguf-vram-estimator.py ~/models/your-model.gguf --contexts 4096
# Download model
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download model/name file.gguf --local-dir ~/models/
# Exit toolbox
exit
Installation complete! You’re now ready to run large language models locally on your AMD Strix Halo system.
changes: