This is a no-nonsense guide to getting an LLM server running quickly on a Strix Halo system (like the Framework Desktop). I got frustrated that a majority of the tutorials were over-complicated, required Docker containers for everything, or were all video based, so here you go.
This tutorial is for folks new to the scene or those who just want to get running quickly. I figure a non-small percentage of folks just want an LLM server without fiddling with a thousand different knobs. Don’t let this dissuade you from experimenting or trying to get better performance.
Avoiding ROCM For The Time Being
As of writing this guide, rocm (AMD’s answer to NVIDIA’s CUDA) is unstable and a royal pain in the rear to get configured correctly. It crashes, has problems with many model backends, and is just not worth my own time for the Strix Halo. It might be perfect down the road, but it’s currently awful and somehow more frustrating than NVIDIA’s linux drivers.
There are some great folks in the open source community doing great work to provide toolboxes and patches to alleviate some of the headaches, and all props to them. But, vulkan currently just works and the performance is usually the same or better than rocm on the Strix Halo chipsets.
Picking an OS
Point blank, I’d recommend Ubuntu 25.10 as of writing. If there is a newer version of Ubuntu when you read this, pick that one. Three things worth mentioning:
- You want a distro that has newer and regular Linux kernel updates
- The Strix Halo / Ryzen AI Max 395 does not have mature support right now and will not run as efficiently or as stable as it will with newer kernels.
- Ubuntu has one of the largest user and developer bases which means generally timely bugfixes/improvements and wide software compatibility.
- If you don’t like Ubuntu, no hate - different folks have different preferences and that is fine.
- I do not recommend enabling full hard drive encryption as it will have a performance hit when loading models.
- Enable it if you need to, but most use cases you won’t need it.
There are endless guides on how to install Linux distros, so find one if you are unfamiliar. Unless you are dual booting, it’s pretty straight forward.
Update Your Software
This is necessary and should be done at least once a month. Open the terminal and run the following to update your repository information and then update your software. This may take anywhere from a minute to longer depending on your internet speeds. You will need to provide your administrator password.
sudo apt updatesudo apt upgrade -y
Update the BIOS
Updating the bios allows for performance increases from AMD firmware updates as well as new features like higher USB charging limits on the Framework Desktop. You should generally update the bios anytime one is available for security and performance updates. That said, do research before upgrading the bios as certain versions may have issues.
While I have not seen any problems personally, others have found that v3.04 has had long boot times and other goofiness. The second step in this section will tell you which bios version would be updated if you choose to go through with it on the third step.
This may take a few minutes and will require a reboot. If an update is required, your machine may appear belly up for a couple minutes after the firmware is flashed – this is normal. Do not unplug or touch anything. Allow it to do its thing for a few minutes.
sudo fwupdmgr refresh –forcesudo fwupdmgr get-updatessudo fwupdmgr update- It will ask to reboot, reply “y” and let it go for several minutes
Setting the GPU Memory Allocation
After the reboot, we need to set the memory allocated to the GPU. There are a few different ways to do this. The one I chose is to go through the BIOS.
- Reboot the computer and mash [F2] until the BIOS comes up.
- Go to Setup Utility >> Advanced.
- Set the iGPU Memory Configuration to “Custom”
- Set the iGPU Memory Size to “96 GB”.
- If you are only running smaller models or have a system with less RAM, pick something that makes sense.
- Go to Exit >> Exit Saving Changes and select “Yes”.
In the future, automatic memory allocation may be better supported and this won’t be necessary, but for now, this is the easiest way. The options may also change with future bios updates.
Installing Basic Tools
After the machine comes back from the bios update step, open a terminal and install curl to be able to pull most of the AI tools and run scripts. Libfuse2 is optional, but will allow you to run various GUI’s common in AI processing. Nvtop will allow you to monitor your GPU usage to make sure that you are actually running on the GPU and not CPU in later steps.
sudo apt install curl nvtop libfuse2 -y
Installing Ollama Server
Install Ollama using their instructions at https://ollama.com/download. At the time of writing, open a terminal and run the following. You will be prompted for your administrator password.
curl -fsSL ``https://ollama.com/install.sh`` | sh
Unless Ollama changes much between writing this guide and when you find it, this will install Ollama as well as make and enable a background service.
Testing the Basic Install
We will need to make some changes to the service in a moment, but first we need to test it. I recommend pulling down IBM’s Granite Tiny 4 and running it because it is small and agile.
Open a terminal and run:
ollama pull granite4:tiny-hollama run granite4:tiny-h
Give a basic prompt (like “Tell me a short story.”), make sure it spits out something according to your prompt. When you are done, type /bye to exit the chat.
Enabling GPU Support (Vulkan), Network Support, and Flash Attention
Now we need to enable vulkan, allow the Ollama server to talk to your local network, and allow flash attention. Flash attention is optional, but recommended. In a terminal enter the following:
sudo systemctl edit ollama
You’ll get a nano editor with the service’s settings. Add the following lines after the “### Anything between here and the comment below<…>” to set environmental variables.
- The OLLAMA_HOST line will allow anything on your local network (the 0.0.0.0 part) to find and talk to the server. If you are working purely local to your machine, don’t add this.
- The OLLAMA_VULKAN line requests Ollama to use vulkan rather than rocm which has been problematic.
- The OLLAMA_FLASH_ATTENTION line will enable flash attention which is ideal for larger contexts. The last line is optional and will depend on your needs.
- The OLLAMA_CONTEXT_LENGTH line allows you to set the context window size.
- Note that not all models will support large contexts. Change this value up or down to suit your needs. Be aware that large contexts may also degrade performance.
- The default if you don’t add this line is 4096 tokens which is not many. 32k is a good number for general coding. I typically set mine to 50k or more if the model can handle it.
### Editing /etc/systemd/system/ollama.service.d/override.conf
### Anything between here and the comment below will become the contents of the drop-in file
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_VULKAN=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_CONTEXT_LENGTH=32768"
### Edits below this comment will be discarded
Press [CONTROL] + [o] and then [CONTROL] + [x]
Either reboot the machine or run the following to reload and restart the service. If you make changes to the service configuration, you will need to do this after each change.
sudo systemctl daemon-reloadsudo systemctl restart ollama
Check to make sure the service is still running. If not, this should give you information about what happened. You are looking for something near the top that says “Active: active (running)”. Press [q] when done looking at the status.
sudo systemctl status ollama
Next we check to make sure that the GPU is being utilized. In a terminal window, run:
nvtop
Look at memory utilization near the top. On mine, I have GPU[ 1%] MEM[ 0.483Gi/96.000Gi].
In a different terminal window, load an ollama chat
ollama run granite4:tiny-h
Looking at the nvtop window, your memory usage should have just jumped to something like 4.967Gi. If you give it a prompt, you should be able to watch your GPU utilization jump while the prompt is processed as well.
Installing a GUI (AnythingLLM)
This is optional. If you want something similar to ChatGPT’s prompt window with some great features, AnythingLLM is great. Not only does it give you a GUI to manage conversations, but it lets you play with things like RAG databases without much trouble. Like with the distros, pick something you like - AnythingLLM is just easy to install and use.
To install, visit https://docs.anythingllm.com/installation-desktop/linux#install-using-the-installer-script and follow the instructions. At the time of writing, open a terminal and run:
curl -fsSL ``https://cdn.anythingllm.com/latest/installer.sh`` -o installer.shchmod +x installer.sh./installer.sh- When it asks you if you want to create an AppArmor profile, answer “y” (yes)
- You can start it from the terminal if you’d like by answering “y” to the next question.
To open AnythingLLM, go to your app launcher, find AnythingLLM, and click it. To configure:
- Click the “Get Started” button when it launches.
- You’ll get a page that says “LLM Preference”.
- Scroll down and select “Ollama”.
- If you are on the same machine, it should automatically find the Ollama server.
- If you are on a different machine, go into the advanced settings, and add the IP address and port number for the machine you hosted Ollama on.
- Click the right arrow button twice and skip the survey. Come up with a workspace name and click the right arrow.
- From there, click the workspace name on the left to open a prompt window.
- Click the brain icon just under where you’d type a message to select your model, type a prompt, and off you go!
Other Recommendations
From here, you are basically done. I would recommend a few things to smooth your experience.
- Make the IP address for your machine static so that it is the same on your network every time you want to use it.
- Can be done either on your router or by setting a static IP in your OS’s settings.
- It is a good idea to use something other than port 11434 for your Ollama server.
- (Pick a random high number port less than 65535 and remember to use it when setting up other tools to work with Ollama.)
- It is wise to install clamav and configure it to update and run nightly.
- Update your software and bios regularly.
- It’s really easy to forget to maintain computers which makes you a REALLY easy cyber target. Set an appointment on your phone calendar to update your machine at least once a month.
- Go back into your bios and disable USB boot after you are done with an OS install.
Model Safety Reminder
One last bit since I’m assuming this guide will cater to folks new to AI and LLM models – be very, very weary of models you find online that aren’t from a reputable source, especially if using them as agents with access to files or command line. Just because they said they were tuned and are magically delicious doesn’t mean that they are what you are expecting. It’s entirely possible and in some cases probable that the models were tweaked to intentionally provide you incorrect or biased output (outside of hallucinations) and potentially poisoned in ways that could damage your data, your machine, or your network. It sucks that not everything is trustworthy, but that’s reality.
Treat models like software. You don’t (or shouldn’t…) install random software from some random website off the internet - don’t just run random models. There is no way to scan models with an antivirus to make sure they aren’t evil. Be smart and be safe, especially if the computer is on a corporate network.
Edits
- Warning about BIOS issues on 25 Dec 2025 and some grammar updates.
- Added setting for context window size on 26 Dec 2025.