I managed to get ollama running in docker on our FW Desktop. Running on Ubuntu 25.10 … and since i had quite some troubles, here is a short summary, maybe it helps others.
We want to run different models in ollama on our Framework Desktop. For this it is running ollama and openwebui and others in docker containers. We also want to use the GPU for this.
Have you tried llama.cpp? If not, I strongly recommend to look into it. After all, Ollama was built on top of it. They do have their own inference engine now for newer models, but llama.cpp is more lightweight, has a very active development community and generally works better and gives you more control over parameters.
One thing ollama does better is dynamically managing loaded models, but you can achieve most of that with llama-swap.
No i havent tried llama.cpp … for us the Framework Desktop is more of a playground device for multiple people experimenting with very different AI ideas and seeing how they are handled in different models and Hardware. So the fact that ollama allows us to very easily switch models, add new ones, checkout different model sizes and the effect of it, … is more important than tweaking the performance. And for that we have a platform here with a nice setup and very good performance, nothing like a setup with multiple dedicated GPUs, but that wasnt the goal for us.
I am certainly interested in trying llama.cpp but for now to me ollama seems like the better match for our needs.
Regarding speeds i am unsure how to answer since the speed greatly varies depending on model, model size and prompt … we used our own little “benchmark” prompt of “write a python function to extract email adresses from a string” for the initial setup … with the latest llama3 model we got ~ 20 T/s on the CPU and ~40 T/s now on the GPU , but have seen much higher T/s values with small gemma models and much lower ones with other models. I dont know how to answer the question for speeds, but again for us it is at least currently more important to be able to experiment than being able to get the absolute best performance out of the hardware
Ollama for experimentation is fine, just beware that it obscures a lot of important parameters, especially context size. So if you are new to LLM, you may not realize that the default context from Ollama is 4096 tokens which is OK for simple chat, but not adequate for anything else, like RAG, agentic flows, etc. So you either need to set it via request parameters (num_ctx, but only if the client supports it), or “clone” a model (really, just a modelfile) and set it there.
Llama 3 models are too old now, I recommend more modern MOE models like Qwen3-30B, GLM 4.5 Air, GPT-OSS 20B and 120B (the latter is the ideal model for Framework Desktop). MOE models will give you quality of bigger dense models but with much faster inference speed due to a small number of active parameters.
This is only useful if you have a single model or use case. Generally, you want as much context as you can fit into VRAM up to the max supported by the model, and that would vary depending on the model you use. Or you can use a combination of both. Set up a sane default (I’d recommend at least 16K) and then create model “clones” with max feasible context size for each model. “Clones” are in quotation marks because they won’t take extra disk space - Ollama uses an approach similar to Docker, so only the config file layer will be different for the “clone”.
Or you could spend some extra time once to set up llama-swap + llama.cpp and have fine-grained control over model parameters and better performance.
sure … i do a local build of the ollama image because i added and changed some logging for us, but this should pretty much be it with rjmallagons provided image
services:
ollama:
image:
ghcr.io/rjmalagon/ollama-linux-amd-apu:optm-latest
restart: unless-stopped
environment:
OLLAMA_FLASH_ATTENTION: true
OLLAMA_DEBUG: 0
# setting Context length to 8192 for longer data
OLLAMA_CONTEXT_LENGTH: 8192
volumes:
- ollama_storage:/root/.ollama
devices:
- "/dev/kfd:/dev/kfd"
- "/dev/dri:/dev/dri"
group_add:
- video
ports:
- "11434:11434"
Thanks so much. This worked for me! Models are loading on my GPU now! I tried building directly from your fork, but it did not give me the same results. Is there stuff that is not yet published to the fork?
very good point, i should update my original message… i actually switched to a branch (since rjmalagon did as well) that is closer following the current ollama main branch
I think my issue building (as i did try his apu_optimizer branch) might have been that i did not set the FLAVOR? Thanks again. This is basically what I have now too!
did you try to add the env var: GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON
with llama.cpp allows you to allocate on RAM and not in VRAM and to use the entire memory with the rocm/hip backend. If Ollama has not disabled it and uses a fairly recent version, it could work. This avoids having to configure the GTT