Using a Framework Desktop for local AI


We know a lot of you are exploring Framework Desktop to crunch machine learning and local AI inference workloads right on your desk. This is a topic that goes deep, and we’re going to build out a series of guides and videos to help you get started. In this first one, we’ll go through the basics of getting started in the easiest way possible with local Large Language Models (LLMs) on Windows. In future guides, we’ll go deeper into code generation, image/video generation, running models on Linux, and clustering multiple Framework Desktops to handle massive models.

Why build a local AI PC?

First, you may be wondering why you’d even want to run AI locally, given how many cloud-based services and applications there are. One of the main reasons is the privacy you get from being able to keep all of your data local. Beyond that, you also have deeper control over model selection and modification, including being able to download, modify, and run uncensored models. If you’re running AI constantly and heavily, you may also save money running locally rather than paying for cloud time. Finally, because you’re running AI locally, you can also run it fully offline, making it useful off grid or as a backup source of knowledge when infrastructure is down.

Traditionally, one of the big challenges with running AI inference locally is being able to run large models. Larger models in general have a deeper set of knowledge to draw from, but require substantial amounts of memory. Consumer graphics cards have plenty of compute and memory bandwidth to crunch AI inference, but are typically limited in memory capacity to 8, 16, or 24GB. The Ryzen AI Max in Framework Desktop has configurations with up to 128GB of memory, allowing much larger models. We’ll get deeper into model selection and tradeoffs later in this guide.

Getting started with the basics of local LLMs

There are a huge number of applications and toolkits available for running AI models locally. The simplest application we’ve found to get started with for the text and code generation AI use case is LM Studio. It’s built on top of llama.cpp, which is an extremely powerful and extensible open source inference library that we’ll go into in a future guide. LM Studio packages it up in a user-friendly application that runs on both Windows and Linux. With this guide, we’ll focus on Windows, but the same settings will work on Linux. Note that as of June 2025, inference runs about 20% faster on Fedora 42 than on Windows 11.

Once you download, install, and open LM Studio, you’ll be presented with a startup screen that asks you if you’d like to get started with your first LLM. LM Studio is good at keeping up to date with the latest models, so it’s usually reasonable to start by downloading their recommended one, and then clicking Start New Chat.

If you’ve installed your Framework Desktop Driver Bundle, LM Studio should already detect your GPU AI acceleration capabilities and enable the relevant runtime for it. Before loading and running the model, you should make sure that LM Studio is fully offloading the model onto the GPU. Click on the “Select a model to load” dropdown at the top, toggle “Manually choose model load parameters”, and click on the arrow next to the model you’d like to configure.

Slide the GPU Offload slider to the maximum number, toggle the “Remember settings” selection, and click “Load Model”. You can then type into the chat box and start chatting with the LLM locally! Note that you can also do things like attach text or pdf files for analysis, and for “vision” models, images too.

Selecting AI models to run locally

Where running AI locally really gets interesting is in the breadth of models that are available. LM Studio has a convenient feature to search for and download models from Hugging Face, which is a large community around AI models and data. If you click on the magnifying glass icon in the left sidebar, that takes you to the Discover tab. The default list is a set of recommended models from the LM Studio team that are usually excellent choices, but you can also search beyond that. Let’s pick a few example models that are optimized for different tasks.

First, let’s pick Mistral Small 3.2, which is a 24B open-weights model from MistralAI. 24B indicates that it’s a model that contains 24 billion parameters. The larger a model is (the more parameters), in general, the smarter it can be. However, a larger parameter count means both that it needs more memory to load and that it will run slower, since each token of text generation needs to be processed through the entire set of parameters. Getting to 10 tokens per second (tok/s) or higher of output speed is a good target, since it means the model will be generating text at least as quickly as you can read it. One way to increase speed is through quantization, which represents a model in a smaller number of bits per parameter while slightly reducing accuracy. In general, you can run models at Q6 (6-bit quantization) without noticeable degradation. To download the Q6_K version of Mistral Small 3.2, select it from the dropdown and download it.

Going back to the chat tab, you can then unload any previously loaded model and load Mistral Small 3.2, making sure to adjust the settings to use full GPU Offload. A Framework Desktop will currently run this model at just around 10 tok/s (12 tok/s on Linux). We expect that number to climb over time as AMD, llama.cpp, and LM Studio continue to mature the AI inference stack for performance, but in the meantime, in the next section we’ll go over ways to optimize performance.

Beyond LM Studio’s staff picks like Mistral Small 3.2, you can browse around Hugging Face for models to download and run. The community-created leaderboards can be good places to find new models to use, like this code generation leaderboard or this uncensored one. When you find which model you’d like, you can download it directly from LM Studio by searching for it, selecting the right quantization, and downloading. Models in the 20-30B parameter count should run at “real time” speed at Q6, but you can also download even larger models like 70B ones if you are ok with waiting longer for responses to be generated.

Advanced model selection and performance optimization

What if you want to run even bigger models while still keeping speed high? One path to run faster local inference is through using Mixture of Experts (MoE) models. These are models which have a larger number of total parameters, but a smaller number that are active on any specific token. We’ll start with Qwen3 30B A3B to test this. This is an open-weights MoE model from Alibaba Cloud that has 30B total parameters with 3.3B active.

You’ll notice when running this that it is a reasoning model, which means it has a thinking phase where it breaks down and thinks through your prompt step by step before answering. It’s especially helpful to have an MoE model for this, since the thinking phase could otherwise be slow. Framework Desktop is especially well suited for MoE models, since you can configure it with a large amount of memory, and the smaller active parameter count means it can run faster. A Framework Desktop can run this model at around 40 tok/s (48 tok/s on Linux).

If you really want to push this to the limit, the single strongest model you can currently run on Linux on a 128GB Framework Desktop today (June 2025) is likely Llama 4 Scout 17B 16E, which is a 109B parameter model with 17B active parameters! At Q6 on LM Studio on Fedora 42, this runs at >14 tok/s! That is a massive model running at real time interaction speeds.

When loading a model, you can also toggle “Show advanced settings” to go deeper into optimization. Two settings you may find yourself adjusting are Context Length and Flash Attention. Context Length is effectively the attention span of the model, so having longer length helps a lot for both conversation/roleplay and code generation use cases. Increasing context length can substantially increase memory usage, but enabling Flash Attention helps mitigate that.

Configuring a Framework Desktop as a local AI PC

With up to 128GB of memory, 256GB/s of memory bandwidth, and a big integrated GPU on Ryzen AI Max, Framework Desktop is a great fit for running AI locally. With AMD’s Variable Graphics Memory functionality, up to 112GB of this is addressable by the GPU! In AMD Adrenaline, you can adjust Dedicated Graphics Memory to up to 96GB, and up to half of the remaining System Memory will also be used.

We have 32GB, 64GB, and 128GB configurations of Framework Desktop. All three have the same 256GB/sec of memory bandwidth, which will typically be the performance bottleneck for LLM inference. The 64GB and 128GB have slightly larger integrated GPUs (40CU instead of 32CU), which matters more for AI workloads like image generation but less for text generation. This means that overall, you should select your configuration primarily based on how large of models you want to be able to run. As noted before, 20-30B parameter models are in a sweet spot that enables both real time interaction and strong capability. Those fit well on 64GB and 128GB configurations, leaving room for higher context lengths or multitasking. To run 70B-class and larger models, you’ll need the 128GB configuration.

Aside from that, when configuring your Framework Desktop DIY Edition, you’ll want to make sure you have enough storage space for all of the models you’ll be downloading, so 1TB or more is helpful. Note that there are two NVMe storage slots, so you can max out at up to 2x 8TB.

For OS selection, both Windows and Linux work well with applications like LM Studio, but if you want to go deeper into using ROCm or PyTorch, you may find the development environment in a recent Linux distro like Fedora 42 to be smoother. As noted earlier, inference on Fedora 42 is also currently about 20% faster than on Windows, though we expect speed on both to continue to improve as AMD drives optimizations throughout the stack.

That’s it for this first intro guide. We’ll be continuing the series with additional guides around more local AI use cases.

15 Likes

Probably a mistake as you meant Desktop here.

The post is nice and useful as an AI guide for any computer really, thanks for that!

1 Like

Oops :see_no_evil_monkey: thanks for the catch - fixed!

1 Like

Just FYI to anyone reading this: LM Studio pulls models (and quantized models) from Hugging Face (and website for sharing models like these) and promotes the use of models that were uploaded on the lmstudio-community account.

However the models on the lmstudio-community account often don’t use the latest and greatest quantization techniques, so you can often find models from other sources with better quality at the same size.

Major techniques that lmstudio-community doesn’t use are (click on each to trigger a dropdown info):

  • Importance Matrix (imatrix)An importance matrix (imatrix) contains measurements of how important each parameter in the model is, which can be used in the quantization process to improve quality. For example in Q6_K each block of 16 parameters (and also each superblock of 16 blocks) has a scaling factor that all parameters in the block are multiplied by. Knowing how important each parameter is enables more accurate calculating of the optimal scaling factor. This can optionally be used with most quants (Q8 doesn't support it, codebook quants require it IIRC, and all other quants can optionally use it).

    The imatrix is computed by having the model process a large chunk of text and measuring how important each parameter is for the model's understanding of the text, so the text being used can impact quality (although it seems variation isn't huge between different sources).
  • Codebook QuantsThese use a codebook (a list of the most common/important combinations of 4 or 8 parameters) and then for every 4 or 8 parameters the model contains an index in the codebook to look at. At the same file size this improves quality but hurts generation speed compared to older methods like Q2_K and Q3_K. These are 1-bit, 2-bit, and 3-bit class quants with names starting with "IQ".
  • Non-linear QuantsThese use a polynomial function to quantize and de-quantize. This achieves similar quality to codebook quants but with faster speeds. In main llama.cpp these are 4-bit class quants with names starting with "IQ" (IQ4_XS and IQ4_NL, IQ4_NL has some optimizations for ARM).

    The developer responsible for the non-linear quants did also create 2-bit, 3-bit, 5-bit, and 6-bit class non-linear quantization methods, but due to some dispute he stopped contributing to main llama.cpp before contributing those quants. They are available in his fork (ik_llama.cpp), but programs like LM Studio are built on main llama.cpp and can't use the fork.
  • Unsloth DynamicThe company unsloth has a method (which they call Unsloth Dynamic) to determine how sensitive each part of the model is to quantization and then quantize the most sensitive parts less than the rest of the model. Although not all quantized models from unsloth use this technique, only ones containing "UD" in the file name.

Based on my limited testing I recommend selecting models from the accounts unsloth, bartowski, or mradermacher (in that order).

Gemma 3 (Google’s open model) is a bit of an exception because Google provided an official 4-bit quantized variant that has had additional training (Quantization Aware Training or QAT) performed to minimize quality loss from quantization. So for that downloading from lmstudio-community is fine as they use what Google already provided.

6 Likes

This makes me want one even more

1 Like

A similarly newbie-friendly overview for other types of “AI” would be appreciated, like image/video generation or coding assistance. Or are media use cases better served by a dedicated GPU for now? Most information seems to be geared towards pure chatbots.

1 Like

Thanks a lot @catastrophic and @Kyle_Reis ! Very useful and clear information your are providing!
Digged into local LLMs recently for my FW13 and your posts really help me understand which models I should use. Also really great to read about the influence of the different Desktop variants on inference. Am happy that i chose decent specs and looking forward for shipping :slight_smile:

Did you also try running Ollama on the Desktop? How’s the current support for ROCm, it used to be somewhat hacky with the AI Max processors a few weeks back?

6.4.1 on has basic (rocBLAS) support (gfx1151) but for all the kernels/best performance you should use one of the gfx1151 nightlies available for d/l here: Releases · ROCm/TheRock · GitHub

For regular llama.cpp-based inference, I’ve just posted the testing I’ve done and I believe that you’re mostly fine/better off with Vulkan generally (HIP sometimes does much better for prefill, but for token generation, Vulkan is almost always faster atm).

1 Like