Hi,
I would like to buy the Framework Laptop 13 with Ryzen™ AI 300 Series. I plan to install either Ubuntu or Fedora, and another important consideration is running an LLM locally (at most a 14B model). Do you have any benchmarks for various models like Gemma 3, the Llama series, or DeepSeek running on the Ryzen AI 300 series, so I can pick the right processor and RAM? What is the recommended software stack for running these LLMs locally on Linux?
Where are we stuck actually? Is it the Linux kernel or AMD driver or a lack of support in inference engines like llama.cpp ? Does Rocm support NPUs & iGPUs or hybrid model? Why did Framework choose Ryzen AI series processors over Intel core ultra series?
Quite surprised to know that AMD is favouring only windows now with its AI series processors using tools like Gaia. Hope that they will expand the support to GNU/Linux systems very soon. They should have done it in the other way since most of the universities, schools, enterprises etc. that I know or work with use GNU/Linux for running servers and bau applications.
Linux now supports the amdxdna driver. However, popular inference engines like llama.cpp currently lack integration. Ideally, these applications and libraries should begin leveraging the amdxdna driver, either directly or through appropriate abstractions.
I don’t know about rocm.
AMD SoC are years ahead of intel equivalent. I don’t about framework choice.
To help strengthen Linux support for XDNA, consider upvoting relevant GitHub issues. This will signal demand and could encourage AMD to prioritize support more quickly
NPU on Linux: the upcoming Ryzen AI 1.5.0 release is supposed to have stronger LLMs-on-Linux support, which may enable us to support NPU-only on Lemonade on Linux.
Thought about a new thread but figure this is relevant here …has anyone successfully gotten llama.cpp or similar to work with the Framework 13 in linux?
Using a AI 5 340 board in DIY form with Fedora 42
I’ve followed a number of guides, and llama.cpp builds but llama.cpp consistently gives the same errors and will not load models:
load_backend: failed to find ggml_backend_init in ~/opt/llama.cpp-vulkan/build/bin/libggml-rpc.so
load_backend: failed to find ggml_backend_init in ~/opt/llama.cpp-vulkan/build/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/opt/llama.cpp-vulkan/build/bin/libggml-cpu.so
the compilation doesn’t error and trying to build with -DGGML_BACKEND_DL=ON does cause errors so I have not tested that
no public help has actually pointed to a method to fix this
I know this seems like a llama.cpp specific git issue, but it is heavily tied to the hardware configuration so asking the community.
I’ve followed the desktop guides from @lhl and @Lars_Urban with contributions by @kyuz0 (though I’d rather not use an arbitrary container and the toolbox tool was not found in the repo anyway).
~/opt/llama.cpp-vulkan/build/bin/llama-cli --list-devices
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 840M Graphics (RADV GFX1152) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 840M Graphics (RADV GFX1152))
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen AI 5 340 w/ Radeon 840M)
load_backend: failed to find ggml_backend_init in ~/opt/llama.cpp-vulkan/build/bin/libggml-rpc.so
load_backend: failed to find ggml_backend_init in ~/opt/llama.cpp-vulkan/build/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in ~/opt/llama.cpp-vulkan/build/bin/libggml-cpu.so
Available devices:
Vulkan0: AMD Radeon 840M Graphics (RADV GFX1152) (65877 MiB, 65707 MiB free)
~/opt/llama.cpp-vulkan/build/bin/llama-cli -m /home/<username>/opt/llm_models/models/mistral_models/7B-Instruct-v0.3/model.q8_0.gguf -ngl 99
bunch of output that includes the same 3 errors above and a failure to load model message
load_tensors: loading model tensors, this can take a while... (mmap = true)
llama_model_load: error loading model: missing tensor 'token_embd.weight'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '~/opt/llm_models/models/mistral_models/7B-Instruct-v0.3/model.q8_0.gguf', try reducing --n-gpu-layers if you're running out of VRAM
main: error: unable to load model
You will generally get better performance with rocm than vulkan. The catch is you need either rocm 6.4.4 from 6.4.x series or rocm 7.0.2 from 7.0.x series.
ehsanj post points to LMStudio, I swear I looked at it before and identified some issue. Guess I should look at it again. [edit it annoys me that only an appimage and they don’t seem to publish any verification methods to confirm the download]
I’d made notes on rocm but had tried to use vulkan. I’d seen contradictory evidence on the rocm vs vulcan performance that was use-case specific.
okay further information LMStudio did work, but to get the appimage to work I had to extract it first. running it directly gave error “dlopen(): error loading libfuse.so.2” but Fedora says fuse2 and fuse3 are installed. Search did not find libfuse in packages.
3.91 tok/sec, using Mistral small 2509 but not sure if GPU or CPU
Models are stored to ~/.lmstudio/models/lmstudio-community
If you symbollically link existing model folders into that path it will auto add them to LMStudio
I learned my existing gguf exports from Huggingface safetensors using llama.cpp are bad, maybe error I got before is a false-postive, or it prevents correct gguf creation using llama
confirm llama-cli works with good gguf models;
prior command variant of python llama.cpp/convert-hf-to-gguf.py ./phi3 --outfile output_file.gguf --outtype q8_0 was suspect
as ~/opt/llama.cpp-vulkan/build/bin/llama-cli -m /home/<username>/.lmstudio/models/lmstudio-community/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q4_K_M.gguf -ngl 99 works
This made me chuckle . My fans spin up to high 3000 to low 5000 RPMs when using gpt-oss-20b (MXFP4) for the duration of a query, and I get ~23 tok/sec. When using a Q4_K_M model (e.g., mistral-7b-instruct-v0.3), it drops to ~16 tok/sec.
You can try btop (CPU+GPU) and/or nvtop (GPU-only) to confirm GPU usage. If you got enough memory to offload your model to the GPU, that’ll improve perf significantly.
As a side note, I use Gear Lever to manage App Images. It makes the overall experience of working with them much nicer.
Model Comparison table shows with Vulkan I’m getting roughly 70% of your throughput , though I don’t know if you have the higher end GPU or not. I asked the same dumb question over and over again “you are an intelligent AI, how many licks does it take to get to the center of a sucker” to compare the models. Obviously it recycled the answer and got cheeky.
model
time to first token (s)
tokens/sec
memory (GB)
parameters (B)
gpt OSS 20B
12.44
18.57
16.5
20
mistral 7B Instruct v0.3
22.6
11.42
8.9
7
mistral/magistral small 2509
85.2
3.65
20.6
24
mistral/devstral small 2507
143.16
3.26
20.9
24
microsoft/phi-4-reasoning-plus
70.75
5.67
15.8
15
Fan use is much higher for the Mistral 24B models and the Phi 4 model. Phi 4 is odd, I dumped the markdown comparison table in and asked for the best model, it only took 7.25s to first token but throughput dropped to 3.86 tokens/s. With very limited empirical testing the other models seem more consistent.
Thanks for the help, I’ll eventually get around to testing with ROCm also.
Those numbers are pretty solid, actually. I got the Ryzen AI HX 370 with 96gb memory and 2tb storage. I also have allocated minimum VRAM (0.5gb) in the BIOS and rely on UMA to allow the iGPU to use system memory as needed.