We know a lot of you are exploring Framework Desktop to crunch machine learning and local AI inference workloads right on your desk. This is a topic that goes deep, and we’re going to build out a series of guides and videos to help you get started. In this first one, we’ll go through the basics of getting started in the easiest way possible with local Large Language Models (LLMs) on Windows. In future guides, we’ll go deeper into code generation, image/video generation, running models on Linux, and clustering multiple Framework Desktops to handle massive models.
Why build a local AI PC?
First, you may be wondering why you’d even want to run AI locally, given how many cloud-based services and applications there are. One of the main reasons is the privacy you get from being able to keep all of your data local. Beyond that, you also have deeper control over model selection and modification, including being able to download, modify, and run uncensored models. If you’re running AI constantly and heavily, you may also save money running locally rather than paying for cloud time. Finally, because you’re running AI locally, you can also run it fully offline, making it useful off grid or as a backup source of knowledge when infrastructure is down.
Traditionally, one of the big challenges with running AI inference locally is being able to run large models. Larger models in general have a deeper set of knowledge to draw from, but require substantial amounts of memory. Consumer graphics cards have plenty of compute and memory bandwidth to crunch AI inference, but are typically limited in memory capacity to 8, 16, or 24GB. The Ryzen AI Max in Framework Desktop has configurations with up to 128GB of memory, allowing much larger models. We’ll get deeper into model selection and tradeoffs later in this guide.
Getting started with the basics of local LLMs
There are a huge number of applications and toolkits available for running AI models locally. The simplest application we’ve found to get started with for the text and code generation AI use case is LM Studio. It’s built on top of llama.cpp, which is an extremely powerful and extensible open source inference library that we’ll go into in a future guide. LM Studio packages it up in a user-friendly application that runs on both Windows and Linux. With this guide, we’ll focus on Windows, but the same settings will work on Linux. Note that as of June 2025, inference runs about 20% faster on Fedora 42 than on Windows 11.
Once you download, install, and open LM Studio, you’ll be presented with a startup screen that asks you if you’d like to get started with your first LLM. LM Studio is good at keeping up to date with the latest models, so it’s usually reasonable to start by downloading their recommended one, and then clicking Start New Chat.
If you’ve installed your Framework Desktop Driver Bundle, LM Studio should already detect your GPU AI acceleration capabilities and enable the relevant runtime for it. Before loading and running the model, you should make sure that LM Studio is fully offloading the model onto the GPU. Click on the “Select a model to load” dropdown at the top, toggle “Manually choose model load parameters”, and click on the arrow next to the model you’d like to configure.
Slide the GPU Offload slider to the maximum number, toggle the “Remember settings” selection, and click “Load Model”. You can then type into the chat box and start chatting with the LLM locally! Note that you can also do things like attach text or pdf files for analysis, and for “vision” models, images too.
Selecting AI models to run locally
Where running AI locally really gets interesting is in the breadth of models that are available. LM Studio has a convenient feature to search for and download models from Hugging Face, which is a large community around AI models and data. If you click on the magnifying glass icon in the left sidebar, that takes you to the Discover tab. The default list is a set of recommended models from the LM Studio team that are usually excellent choices, but you can also search beyond that. Let’s pick a few example models that are optimized for different tasks.
First, let’s pick Mistral Small 3.2, which is a 24B open-weights model from MistralAI. 24B indicates that it’s a model that contains 24 billion parameters. The larger a model is (the more parameters), in general, the smarter it can be. However, a larger parameter count means both that it needs more memory to load and that it will run slower, since each token of text generation needs to be processed through the entire set of parameters. Getting to 10 tokens per second (tok/s) or higher of output speed is a good target, since it means the model will be generating text at least as quickly as you can read it. One way to increase speed is through quantization, which represents a model in a smaller number of bits per parameter while slightly reducing accuracy. In general, you can run models at Q6 (6-bit quantization) without noticeable degradation. To download the Q6_K version of Mistral Small 3.2, select it from the dropdown and download it.
Going back to the chat tab, you can then unload any previously loaded model and load Mistral Small 3.2, making sure to adjust the settings to use full GPU Offload. A Framework Desktop will currently run this model at just around 10 tok/s (12 tok/s on Linux). We expect that number to climb over time as AMD, llama.cpp, and LM Studio continue to mature the AI inference stack for performance, but in the meantime, in the next section we’ll go over ways to optimize performance.
Beyond LM Studio’s staff picks like Mistral Small 3.2, you can browse around Hugging Face for models to download and run. The community-created leaderboards can be good places to find new models to use, like this code generation leaderboard or this uncensored one. When you find which model you’d like, you can download it directly from LM Studio by searching for it, selecting the right quantization, and downloading. Models in the 20-30B parameter count should run at “real time” speed at Q6, but you can also download even larger models like 70B ones if you are ok with waiting longer for responses to be generated.
Advanced model selection and performance optimization
What if you want to run even bigger models while still keeping speed high? One path to run faster local inference is through using Mixture of Experts (MoE) models. These are models which have a larger number of total parameters, but a smaller number that are active on any specific token. We’ll start with Qwen3 30B A3B to test this. This is an open-weights MoE model from Alibaba Cloud that has 30B total parameters with 3.3B active.
You’ll notice when running this that it is a reasoning model, which means it has a thinking phase where it breaks down and thinks through your prompt step by step before answering. It’s especially helpful to have an MoE model for this, since the thinking phase could otherwise be slow. Framework Desktop is especially well suited for MoE models, since you can configure it with a large amount of memory, and the smaller active parameter count means it can run faster. A Framework Desktop can run this model at around 40 tok/s (48 tok/s on Linux).
If you really want to push this to the limit, the single strongest model you can currently run on Linux on a 128GB Framework Desktop today (June 2025) is likely Llama 4 Scout 17B 16E, which is a 109B parameter model with 17B active parameters! At Q6 on LM Studio on Fedora 42, this runs at >14 tok/s! That is a massive model running at real time interaction speeds.
When loading a model, you can also toggle “Show advanced settings” to go deeper into optimization. Two settings you may find yourself adjusting are Context Length and Flash Attention. Context Length is effectively the attention span of the model, so having longer length helps a lot for both conversation/roleplay and code generation use cases. Increasing context length can substantially increase memory usage, but enabling Flash Attention helps mitigate that.
Configuring a Framework Desktop as a local AI PC
With up to 128GB of memory, 256GB/s of memory bandwidth, and a big integrated GPU on Ryzen AI Max, Framework Desktop is a great fit for running AI locally. With AMD’s Variable Graphics Memory functionality, up to 112GB of this is addressable by the GPU! In AMD Adrenaline, you can adjust Dedicated Graphics Memory to up to 96GB, and up to half of the remaining System Memory will also be used.
We have 32GB, 64GB, and 128GB configurations of Framework Desktop. All three have the same 256GB/sec of memory bandwidth, which will typically be the performance bottleneck for LLM inference. The 64GB and 128GB have slightly larger integrated GPUs (40CU instead of 32CU), which matters more for AI workloads like image generation but less for text generation. This means that overall, you should select your configuration primarily based on how large of models you want to be able to run. As noted before, 20-30B parameter models are in a sweet spot that enables both real time interaction and strong capability. Those fit well on 64GB and 128GB configurations, leaving room for higher context lengths or multitasking. To run 70B-class and larger models, you’ll need the 128GB configuration.
Aside from that, when configuring your Framework Desktop DIY Edition, you’ll want to make sure you have enough storage space for all of the models you’ll be downloading, so 1TB or more is helpful. Note that there are two NVMe storage slots, so you can max out at up to 2x 8TB.
For OS selection, both Windows and Linux work well with applications like LM Studio, but if you want to go deeper into using ROCm or PyTorch, you may find the development environment in a recent Linux distro like Fedora 42 to be smoother. As noted earlier, inference on Fedora 42 is also currently about 20% faster than on Windows, though we expect speed on both to continue to improve as AMD drives optimizations throughout the stack.
That’s it for this first intro guide. We’ll be continuing the series with additional guides around more local AI use cases.