Running ollama in docker on our Framework Desktop using the GPU

Pattrick_Hueper · September 28, 2025, 5:05pm

I managed to get ollama running in docker on our FW Desktop. Running on Ubuntu 25.10 … and since i had quite some troubles, here is a short summary, maybe it helps others.

We want to run different models in ollama on our Framework Desktop. For this it is running ollama and openwebui and others in docker containers. We also want to use the GPU for this.

The Default ollama rocm docker container for me did not recognize the GPU/APU at all. So after some searching i came across GitHub - rjmalagon/ollama-linux-amd-apu: AMD APU compatible Ollama. Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models. which at least recognized the APU but the actual execution of models still happened on the CPU.
After quite some fiddling and debugging into the ollama library loading, i managed to find that (at least for ROCm 6.4.4 which i was using) there was a dependency (libroctx64) missing. I forked the great work of rjmalagon and created GitHub - phueper/ollama-linux-amd-apu: AMD APU compatible Ollama. Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models. with the changes i neeeded. I plan to create a PR for rjmalagon after some more testing, but at least for our setup this is currently working great.

Running is pretty much following the existing documentation, but the added library was needed for the GPU model runner to work.

I will maybe add my docker-compose.yaml setup as well here once it is more stable (and i removed all DEBUG switches i needed to find this problem

I hope this is helpful for others, if you have any questions, i can try to answer them

Update: switched to a new branch that is closer following the ollama main branch:

Eugr · September 29, 2025, 12:24am

What speeds are you getting with Ollama?

Have you tried llama.cpp? If not, I strongly recommend to look into it. After all, Ollama was built on top of it. They do have their own inference engine now for newer models, but llama.cpp is more lightweight, has a very active development community and generally works better and gives you more control over parameters.

One thing ollama does better is dynamically managing loaded models, but you can achieve most of that with llama-swap.

Pattrick_Hueper · September 29, 2025, 4:49am

No i havent tried llama.cpp … for us the Framework Desktop is more of a playground device for multiple people experimenting with very different AI ideas and seeing how they are handled in different models and Hardware. So the fact that ollama allows us to very easily switch models, add new ones, checkout different model sizes and the effect of it, … is more important than tweaking the performance. And for that we have a platform here with a nice setup and very good performance, nothing like a setup with multiple dedicated GPUs, but that wasnt the goal for us.

I am certainly interested in trying llama.cpp but for now to me ollama seems like the better match for our needs.

Regarding speeds i am unsure how to answer since the speed greatly varies depending on model, model size and prompt … we used our own little “benchmark” prompt of “write a python function to extract email adresses from a string” for the initial setup … with the latest llama3 model we got ~ 20 T/s on the CPU and ~40 T/s now on the GPU , but have seen much higher T/s values with small gemma models and much lower ones with other models. I dont know how to answer the question for speeds, but again for us it is at least currently more important to be able to experiment than being able to get the absolute best performance out of the hardware

Eugr · September 29, 2025, 4:03pm

Ollama for experimentation is fine, just beware that it obscures a lot of important parameters, especially context size. So if you are new to LLM, you may not realize that the default context from Ollama is 4096 tokens which is OK for simple chat, but not adequate for anything else, like RAG, agentic flows, etc. So you either need to set it via request parameters (num_ctx, but only if the client supports it), or “clone” a model (really, just a modelfile) and set it there.

Llama 3 models are too old now, I recommend more modern MOE models like Qwen3-30B, GLM 4.5 Air, GPT-OSS 20B and 120B (the latter is the ideal model for Framework Desktop). MOE models will give you quality of bigger dense models but with much faster inference speed due to a small number of active parameters.

Pattrick_Hueper · September 29, 2025, 10:22pm

or set it via environment variable to whatever i think i need

OLLAMA_CONTEXT_LENGTH=8192

Pattrick_Hueper · September 29, 2025, 10:23pm

thanks .. will give those a try

Eugr · September 29, 2025, 10:47pm

This is only useful if you have a single model or use case. Generally, you want as much context as you can fit into VRAM up to the max supported by the model, and that would vary depending on the model you use. Or you can use a combination of both. Set up a sane default (I’d recommend at least 16K) and then create model “clones” with max feasible context size for each model. “Clones” are in quotation marks because they won’t take extra disk space - Ollama uses an approach similar to Docker, so only the config file layer will be different for the “clone”.

Or you could spend some extra time once to set up llama-swap + llama.cpp and have fine-grained control over model parameters and better performance.

Tristin_Stagg · October 9, 2025, 10:26pm

Can we please see the Docker Compose?

Tristin_Stagg · October 10, 2025, 4:14am

I got this running and it seems to recognize, but it does not load the models on the GPU - it is all still running on the CPU.

Pattrick_Hueper · October 10, 2025, 1:09pm

sure … i do a local build of the ollama image because i added and changed some logging for us, but this should pretty much be it with rjmallagons provided image

services:
  ollama:
    image:
        ghcr.io/rjmalagon/ollama-linux-amd-apu:optm-latest
    restart: unless-stopped
    environment:
      OLLAMA_FLASH_ATTENTION: true
      OLLAMA_DEBUG: 0
      # setting Context length to 8192 for longer data
      OLLAMA_CONTEXT_LENGTH: 8192
    volumes:
      - ollama_storage:/root/.ollama
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    group_add:
      - video
    ports:
      - "11434:11434"

Tristin_Stagg · October 10, 2025, 4:17pm

Thanks so much. This worked for me! Models are loading on my GPU now! I tried building directly from your fork, but it did not give me the same results. Is there stuff that is not yet published to the fork?

Tristin_Stagg · October 10, 2025, 4:33pm

Also, i did not add cap_add / security_opt / ipc options.

Are those needed for some reason or was that just something you were doing for debugging purposes?

Pattrick_Hueper · October 10, 2025, 4:43pm

yeah, writing the message here i actually wondered wether those were needed, i guess they are just leftovers from debuging

i removed them from my message with the compose file

Pattrick_Hueper · October 10, 2025, 4:46pm

very good point, i should update my original message… i actually switched to a branch (since rjmalagon did as well) that is closer following the current ollama main branch

i’ll update my post.

Tristin_Stagg · October 10, 2025, 4:57pm

Aha! THANK YOU!

I just like to build myself, and I was reading the diff in an effort to follow along. I’m something of a coder myself

Pattrick_Hueper · October 10, 2025, 5:00pm

fwiw … this is my docker compose for building.. should work directly from my branch:

services:
  ollama:
    build:
      context: ./ollama-linux-amd-apu
      args:
        FLAVOR: rocm
    restart: unless-stopped
    environment:
      OLLAMA_FLASH_ATTENTION: true
      OLLAMA_DEBUG: 0
      # setting Context length to 8192 for longer data
      OLLAMA_CONTEXT_LENGTH: 8192
    volumes:
      - ./files/ollama/storage:/root/.ollama
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    group_add:
      - video
    ports:
      - "11434:11434"

Tristin_Stagg · October 10, 2025, 5:03pm

I think my issue building (as i did try his apu_optimizer branch) might have been that i did not set the FLAVOR? Thanks again. This is basically what I have now too!

Tristin_Stagg · October 10, 2025, 6:17pm

How do you have your memory configured in firmware?

Pattrick_Hueper · October 10, 2025, 6:41pm

Haven’t changed anything (yet)… so iirc it allows up to 50% for GPU

Djip · October 14, 2025, 7:31pm

did you try to add the env var:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
with llama.cpp allows you to allocate on RAM and not in VRAM and to use the entire memory with the rocm/hip backend. If Ollama has not disabled it and uses a fairly recent version, it could work. This avoids having to configure the GTT

Topic		Replies	Views
Quickstart Guide: Ollama With GPU Support (No ROCM Needed) Linux ubuntu	3	6868	January 21, 2026
LLM Benchmark (AMD 7840u) Linux	12	4695	January 31, 2026
CashyOS (Arch) ollama / docker iGPU recognition Framework Desktop framework-desktop-ai-max-300 , ai	3	523	September 17, 2025
Which language models are you using? Framework Desktop	46	2383	March 7, 2026
Ollama with GPU on Linux Framework 13 AMD Ryzen HX 370 Linux bluefin	10	3558	December 24, 2025

Running ollama in docker on our Framework Desktop using the GPU

Related topics