Running ollama in docker on our Framework Desktop using the GPU

I managed to get ollama running in docker on our FW Desktop. Running on Ubuntu 25.10 … and since i had quite some troubles, here is a short summary, maybe it helps others.

We want to run different models in ollama on our Framework Desktop. For this it is running ollama and openwebui and others in docker containers. We also want to use the GPU for this.

The Default ollama rocm docker container for me did not recognize the GPU/APU at all. So after some searching i came across GitHub - rjmalagon/ollama-linux-amd-apu: AMD APU compatible Ollama. Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models. which at least recognized the APU but the actual execution of models still happened on the CPU.
After quite some fiddling and debugging into the ollama library loading, i managed to find that (at least for ROCm 6.4.4 which i was using) there was a dependency (libroctx64) missing. I forked the great work of rjmalagon and created GitHub - phueper/ollama-linux-amd-apu: AMD APU compatible Ollama. Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models. with the changes i neeeded. I plan to create a PR for rjmalagon after some more testing, but at least for our setup this is currently working great.

Running is pretty much following the existing documentation, but the added library was needed for the GPU model runner to work.

I will maybe add my docker-compose.yaml setup as well here once it is more stable (and i removed all DEBUG switches i needed to find this problem :slight_smile:

I hope this is helpful for others, if you have any questions, i can try to answer them :slight_smile:

Update: switched to a new branch that is closer following the ollama main branch:

6 Likes

What speeds are you getting with Ollama?

Have you tried llama.cpp? If not, I strongly recommend to look into it. After all, Ollama was built on top of it. They do have their own inference engine now for newer models, but llama.cpp is more lightweight, has a very active development community and generally works better and gives you more control over parameters.

One thing ollama does better is dynamically managing loaded models, but you can achieve most of that with llama-swap.

1 Like

No i havent tried llama.cpp … for us the Framework Desktop is more of a playground device for multiple people experimenting with very different AI ideas and seeing how they are handled in different models and Hardware. So the fact that ollama allows us to very easily switch models, add new ones, checkout different model sizes and the effect of it, … is more important than tweaking the performance. And for that we have a platform here with a nice setup and very good performance, nothing like a setup with multiple dedicated GPUs, but that wasnt the goal for us.

I am certainly interested in trying llama.cpp but for now to me ollama seems like the better match for our needs.

Regarding speeds i am unsure how to answer since the speed greatly varies depending on model, model size and prompt … we used our own little “benchmark” prompt of “write a python function to extract email adresses from a string” for the initial setup … with the latest llama3 model we got ~ 20 T/s on the CPU and ~40 T/s now on the GPU , but have seen much higher T/s values with small gemma models and much lower ones with other models. I dont know how to answer the question for speeds, but again for us it is at least currently more important to be able to experiment than being able to get the absolute best performance out of the hardware

Ollama for experimentation is fine, just beware that it obscures a lot of important parameters, especially context size. So if you are new to LLM, you may not realize that the default context from Ollama is 4096 tokens which is OK for simple chat, but not adequate for anything else, like RAG, agentic flows, etc. So you either need to set it via request parameters (num_ctx, but only if the client supports it), or “clone” a model (really, just a modelfile) and set it there.

Llama 3 models are too old now, I recommend more modern MOE models like Qwen3-30B, GLM 4.5 Air, GPT-OSS 20B and 120B (the latter is the ideal model for Framework Desktop). MOE models will give you quality of bigger dense models but with much faster inference speed due to a small number of active parameters.

3 Likes

or set it via environment variable to whatever i think i need

OLLAMA_CONTEXT_LENGTH=8192

thanks .. will give those a try :+1:

This is only useful if you have a single model or use case. Generally, you want as much context as you can fit into VRAM up to the max supported by the model, and that would vary depending on the model you use. Or you can use a combination of both. Set up a sane default (I’d recommend at least 16K) and then create model “clones” with max feasible context size for each model. “Clones” are in quotation marks because they won’t take extra disk space - Ollama uses an approach similar to Docker, so only the config file layer will be different for the “clone”.

Or you could spend some extra time once to set up llama-swap + llama.cpp and have fine-grained control over model parameters and better performance. :slight_smile:

1 Like

Can we please see the Docker Compose?

I got this running and it seems to recognize, but it does not load the models on the GPU - it is all still running on the CPU.

1 Like

sure … i do a local build of the ollama image because i added and changed some logging for us, but this should pretty much be it with rjmallagons provided image

services:
  ollama:
    image:
        ghcr.io/rjmalagon/ollama-linux-amd-apu:optm-latest
    restart: unless-stopped
    environment:
      OLLAMA_FLASH_ATTENTION: true
      OLLAMA_DEBUG: 0
      # setting Context length to 8192 for longer data
      OLLAMA_CONTEXT_LENGTH: 8192
    volumes:
      - ollama_storage:/root/.ollama
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    group_add:
      - video
    ports:
      - "11434:11434"

3 Likes

Thanks so much. This worked for me! Models are loading on my GPU now! :smiley: I tried building directly from your fork, but it did not give me the same results. Is there stuff that is not yet published to the fork?

Also, i did not add cap_add / security_opt / ipc options.

Are those needed for some reason or was that just something you were doing for debugging purposes?

yeah, writing the message here i actually wondered wether those were needed, i guess they are just leftovers from debuging :slight_smile:

i removed them from my message with the compose file

1 Like

very good point, i should update my original message… i actually switched to a branch (since rjmalagon did as well) that is closer following the current ollama main branch

i’ll update my post.

1 Like

Aha! THANK YOU!

I just like to build myself, and I was reading the diff in an effort to follow along. I’m something of a coder myself :slight_smile:

fwiw … this is my docker compose for building.. should work directly from my branch:

services:
  ollama:
    build:
      context: ./ollama-linux-amd-apu
      args:
        FLAVOR: rocm
    restart: unless-stopped
    environment:
      OLLAMA_FLASH_ATTENTION: true
      OLLAMA_DEBUG: 0
      # setting Context length to 8192 for longer data
      OLLAMA_CONTEXT_LENGTH: 8192
    volumes:
      - ./files/ollama/storage:/root/.ollama
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    group_add:
      - video
    ports:
      - "11434:11434"
1 Like

I think my issue building (as i did try his apu_optimizer branch) might have been that i did not set the FLAVOR? Thanks again. This is basically what I have now too!

1 Like

How do you have your memory configured in firmware?

Haven’t changed anything (yet)… so iirc it allows up to 50% for GPU

did you try to add the env var:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
with llama.cpp allows you to allocate on RAM and not in VRAM and to use the entire memory with the rocm/hip backend. If Ollama has not disabled it and uses a fairly recent version, it could work. This avoids having to configure the GTT