Quickstart Guide: Ollama With GPU Support (No ROCM Needed)

Austin_Steingrube · December 26, 2025, 4:20am

This is a no-nonsense guide to getting an LLM server running quickly on a Strix Halo system (like the Framework Desktop). I got frustrated that a majority of the tutorials were over-complicated, required Docker containers for everything, or were all video based, so here you go.

This tutorial is for folks new to the scene or those who just want to get running quickly. I figure a non-small percentage of folks just want an LLM server without fiddling with a thousand different knobs. Don’t let this dissuade you from experimenting or trying to get better performance.

Avoiding ROCM For The Time Being

As of writing this guide, rocm (AMD’s answer to NVIDIA’s CUDA) is unstable and a royal pain in the rear to get configured correctly. It crashes, has problems with many model backends, and is just not worth my own time for the Strix Halo. It might be perfect down the road, but it’s currently awful and somehow more frustrating than NVIDIA’s linux drivers.

There are some great folks in the open source community doing great work to provide toolboxes and patches to alleviate some of the headaches, and all props to them. But, vulkan currently just works and the performance is usually the same or better than rocm on the Strix Halo chipsets.

Picking an OS

Point blank, I’d recommend Ubuntu 25.10 as of writing. If there is a newer version of Ubuntu when you read this, pick that one. Three things worth mentioning:

You want a distro that has newer and regular Linux kernel updates
1. The Strix Halo / Ryzen AI Max 395 does not have mature support right now and will not run as efficiently or as stable as it will with newer kernels.
Ubuntu has one of the largest user and developer bases which means generally timely bugfixes/improvements and wide software compatibility.
1. If you don’t like Ubuntu, no hate - different folks have different preferences and that is fine.
I do not recommend enabling full hard drive encryption as it will have a performance hit when loading models.
1. Enable it if you need to, but most use cases you won’t need it.

There are endless guides on how to install Linux distros, so find one if you are unfamiliar. Unless you are dual booting, it’s pretty straight forward.

Update Your Software

This is necessary and should be done at least once a month. Open the terminal and run the following to update your repository information and then update your software. This may take anywhere from a minute to longer depending on your internet speeds. You will need to provide your administrator password.

sudo apt update
sudo apt upgrade -y

Update the BIOS

Updating the bios allows for performance increases from AMD firmware updates as well as new features like higher USB charging limits on the Framework Desktop. You should generally update the bios anytime one is available for security and performance updates. That said, do research before upgrading the bios as certain versions may have issues.

While I have not seen any problems personally, others have found that v3.04 has had long boot times and other goofiness. The second step in this section will tell you which bios version would be updated if you choose to go through with it on the third step.

This may take a few minutes and will require a reboot. If an update is required, your machine may appear belly up for a couple minutes after the firmware is flashed – this is normal. Do not unplug or touch anything. Allow it to do its thing for a few minutes.

sudo fwupdmgr refresh –force
sudo fwupdmgr get-updates
sudo fwupdmgr update
- It will ask to reboot, reply “y” and let it go for several minutes

Setting the GPU Memory Allocation

After the reboot, we need to set the memory allocated to the GPU. There are a few different ways to do this. The one I chose is to go through the BIOS.

Reboot the computer and mash [F2] until the BIOS comes up.
Go to Setup Utility >> Advanced.
- Set the iGPU Memory Configuration to “Custom”
- Set the iGPU Memory Size to “96 GB”.
  - If you are only running smaller models or have a system with less RAM, pick something that makes sense.
Go to Exit >> Exit Saving Changes and select “Yes”.

In the future, automatic memory allocation may be better supported and this won’t be necessary, but for now, this is the easiest way. The options may also change with future bios updates.

Installing Basic Tools

After the machine comes back from the bios update step, open a terminal and install curl to be able to pull most of the AI tools and run scripts. Libfuse2 is optional, but will allow you to run various GUI’s common in AI processing. Nvtop will allow you to monitor your GPU usage to make sure that you are actually running on the GPU and not CPU in later steps.

sudo apt install curl nvtop libfuse2 -y

Installing Ollama Server

Install Ollama using their instructions at https://ollama.com/download. At the time of writing, open a terminal and run the following. You will be prompted for your administrator password.

curl -fsSL ``https://ollama.com/install.sh`` | sh

Unless Ollama changes much between writing this guide and when you find it, this will install Ollama as well as make and enable a background service.

Testing the Basic Install

We will need to make some changes to the service in a moment, but first we need to test it. I recommend pulling down IBM’s Granite Tiny 4 and running it because it is small and agile.

Open a terminal and run:

ollama pull granite4:tiny-h
ollama run granite4:tiny-h

Give a basic prompt (like “Tell me a short story.”), make sure it spits out something according to your prompt. When you are done, type /bye to exit the chat.

Enabling GPU Support (Vulkan), Network Support, and Flash Attention

Now we need to enable vulkan, allow the Ollama server to talk to your local network, and allow flash attention. Flash attention is optional, but recommended. In a terminal enter the following:

sudo systemctl edit ollama

You’ll get a nano editor with the service’s settings. Add the following lines after the “### Anything between here and the comment below<…>” to set environmental variables.

The OLLAMA_HOST line will allow anything on your local network (the 0.0.0.0 part) to find and talk to the server. If you are working purely local to your machine, don’t add this.
The OLLAMA_VULKAN line requests Ollama to use vulkan rather than rocm which has been problematic.
The OLLAMA_FLASH_ATTENTION line will enable flash attention which is ideal for larger contexts. The last line is optional and will depend on your needs.
The OLLAMA_CONTEXT_LENGTH line allows you to set the context window size.
- Note that not all models will support large contexts. Change this value up or down to suit your needs. Be aware that large contexts may also degrade performance.
- The default if you don’t add this line is 4096 tokens which is not many. 32k is a good number for general coding. I typically set mine to 50k or more if the model can handle it.

### Editing /etc/systemd/system/ollama.service.d/override.conf

### Anything between here and the comment below will become the contents of the drop-in file

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_VULKAN=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_CONTEXT_LENGTH=32768"

### Edits below this comment will be discarded

Press [CONTROL] + [o] and then [CONTROL] + [x]

Either reboot the machine or run the following to reload and restart the service. If you make changes to the service configuration, you will need to do this after each change.

sudo systemctl daemon-reload
sudo systemctl restart ollama

Check to make sure the service is still running. If not, this should give you information about what happened. You are looking for something near the top that says “Active: active (running)”. Press [q] when done looking at the status.

sudo systemctl status ollama

Next we check to make sure that the GPU is being utilized. In a terminal window, run:

nvtop

Look at memory utilization near the top. On mine, I have GPU[ 1%] MEM[ 0.483Gi/96.000Gi].

In a different terminal window, load an ollama chat

ollama run granite4:tiny-h

Looking at the nvtop window, your memory usage should have just jumped to something like 4.967Gi. If you give it a prompt, you should be able to watch your GPU utilization jump while the prompt is processed as well.

Installing a GUI (AnythingLLM)

This is optional. If you want something similar to ChatGPT’s prompt window with some great features, AnythingLLM is great. Not only does it give you a GUI to manage conversations, but it lets you play with things like RAG databases without much trouble. Like with the distros, pick something you like - AnythingLLM is just easy to install and use.

To install, visit https://docs.anythingllm.com/installation-desktop/linux#install-using-the-installer-script and follow the instructions. At the time of writing, open a terminal and run:

curl -fsSL ``https://cdn.anythingllm.com/latest/installer.sh`` -o installer.sh
chmod +x installer.sh
./installer.sh
- When it asks you if you want to create an AppArmor profile, answer “y” (yes)
- You can start it from the terminal if you’d like by answering “y” to the next question.

To open AnythingLLM, go to your app launcher, find AnythingLLM, and click it. To configure:

Click the “Get Started” button when it launches.
- You’ll get a page that says “LLM Preference”.
Scroll down and select “Ollama”.
- If you are on the same machine, it should automatically find the Ollama server.
- If you are on a different machine, go into the advanced settings, and add the IP address and port number for the machine you hosted Ollama on.
Click the right arrow button twice and skip the survey. Come up with a workspace name and click the right arrow.
From there, click the workspace name on the left to open a prompt window.
Click the brain icon just under where you’d type a message to select your model, type a prompt, and off you go!

Other Recommendations

From here, you are basically done. I would recommend a few things to smooth your experience.

Make the IP address for your machine static so that it is the same on your network every time you want to use it.
Can be done either on your router or by setting a static IP in your OS’s settings.
It is a good idea to use something other than port 11434 for your Ollama server.
- (Pick a random high number port less than 65535 and remember to use it when setting up other tools to work with Ollama.)
It is wise to install clamav and configure it to update and run nightly.
Update your software and bios regularly.
- It’s really easy to forget to maintain computers which makes you a REALLY easy cyber target. Set an appointment on your phone calendar to update your machine at least once a month.
Go back into your bios and disable USB boot after you are done with an OS install.

Model Safety Reminder

One last bit since I’m assuming this guide will cater to folks new to AI and LLM models – be very, very weary of models you find online that aren’t from a reputable source, especially if using them as agents with access to files or command line. Just because they said they were tuned and are magically delicious doesn’t mean that they are what you are expecting. It’s entirely possible and in some cases probable that the models were tweaked to intentionally provide you incorrect or biased output (outside of hallucinations) and potentially poisoned in ways that could damage your data, your machine, or your network. It sucks that not everything is trustworthy, but that’s reality.

Treat models like software. You don’t (or shouldn’t…) install random software from some random website off the internet - don’t just run random models. There is no way to scan models with an antivirus to make sure they aren’t evil. Be smart and be safe, especially if the computer is on a corporate network.

Edits

Warning about BIOS issues on 25 Dec 2025 and some grammar updates.
Added setting for context window size on 26 Dec 2025.

Tactical_Finesse · December 26, 2025, 4:27am

Beware telling people to update their UEFI as of this writing.\

v3.04 UEFI has a known bug of extremely long boot times in Linux–hanging on initd for minutes at a time before resuming normal boot. There’s a 2 week old thread about it, their a github bug tracker report, and Framework has acknowledged it and it apparently traces back to AMD PI.

Austin_Steingrube · December 26, 2025, 4:32am

Good info. I will add a warning to the post. Thank you!

Tim_Towers · January 21, 2026, 11:18pm

This is a great guide, readable and concise.

Let me add my discoveries and some comments…

Ubuntu 24.04 has a better way of allocating memory for LLMs, set the GPU memory allocation low in BIOS and use the grub option for ttm to set the available memory to 108Gb as shown below.

GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash ttm.pages_limit=27648000 ttm.page_pool_
size=27648000”

You may wish to consider extending OLLAMA_KEEP_ALIVE from it’s default of 5 minutes. This class of device can have fairly large models (66Gb for gpt-oss:120b) so to reload an unloaded model can cause a long delay that can be avoided by keeping it in memory. Even gpt-oss:20b takes 9 seconds to load.

I also set OLLAMA_MAX_LOADED_MODELS as many small and fast models might get loaded during my work, and I can have a 60gb gpt-oss in memory as well as 5 x 8gb models.

If you’ve got a good default model that you want to keep “hot” but want to tinker with other models on an ad-hoc basis then a scheduled “ping” will keep it in memory.

Updating your system regularly is really good advice.

I have a great distaste for running random shell scripts as root from the internet as you are completely trusting a remote set of developers to not do anything foolish or malicious. However, the “move fast and break things” mindset that seems to attach to the AI stuff tends to push application installation in that direction rather than going through the trouble to have signed packages installed by apt. You can partially protect yourself by containerising everything untrusted. Considering security is such a part of your document this is a surprising misalignment.

For a desktop in a secure location on a private network it may not be worth the risk in extra complexity to alter default ports or turn off USB boot. Nor is clamav any use in a LLM context as data poisoning affects LLMs in different way to the signatures which clamav relies upon to identify problems. If you download arbitrary files from untrusted places then it won’t hurt much but don’t rely on it to be useful with AI risks.

Topic		Replies	Views
Running ollama in docker on our Framework Desktop using the GPU Framework Desktop	34	4053	October 21, 2025
Llama.cpp/vLLM Toolboxes for LLM inference on Strix Halo Framework Desktop	57	11316	June 21, 2026
LLM Benchmark (AMD 7840u) Linux	12	5007	January 31, 2026
CashyOS (Arch) ollama / docker iGPU recognition Framework Desktop framework-desktop-ai-max-300 , ai	3	567	September 17, 2025
AMD Strix Halo Llama.cpp Installation Guide for Fedora 42 Framework Desktop framework-desktop-ai-max-300 , ai	18	9794	January 14, 2026

Quickstart Guide: Ollama With GPU Support (No ROCM Needed)

Related topics