Hi! From the first moment I have fallen in love with framework the concept is just perfect and should be an example for the industry, so I have clear that my next laptop will be a fw, the problem is the when, so I have a couple of questions to see if you could answer them.
People who already have this equipment with this setup or like this, how has it been running local llms, I am not very involved in the world of llms but I would like to use them for privacy, to code or to speed up some task.
For performance reasons I might be interested in using an egpu to speed up the use of llms, the fw13 ports are T4, from what I’ve read it has a bandwidth limitation but I understand that even if I don’t use all the power of the egpu the result will improve considerably, has anyone tried it?
I love the idea of making use of expansion cards external storgae to boot different OS, my idea is to have a w11 and different linux distros to play with, but I have read that it is not fully recommended despite fw says: "Both are fast enough to run apps and even boot an operating system from", Support does not recommend (REDDIT) Has anyone experienced performance problems, would you recommend it?
The upgradability that fw offers is incredible, the idea of knowing that if in two years time they release a new board with T5 or Oculink ports or even a coreboot compatible board I can simply buy the new board and sell the old one on the marketplace to reduce the contamination, as you know "the one man’s ‘trash’ is another’s treasure". My question is if any of you have sold old hardware to framework to sell on their marketplace, did they give you a good price? In the end this helps to make upgrades cheaper and facilitates the sale in a safe way instead of selling directly to a third party.
If anyone with this set up is using archlinux, how is it going so far, any problems?
If anyone can answer any questions I would appreciate it! Thanks in advance.
I love these questions! can’t answer most of them since you asked for community input but can confirm that framework does not have a buy back policy, however you can sell your old parts or the full laptop in the community marketplace.
Yes, This is the first thread I found talking about this problem, in the reddit post I did quote it but I have removed it here because the forum limits to a maximum of two links per post (I think it doesn’t make a lot of sense to limit it to only two links). TY anyway
I’d love to have your answers if you want, plus you have the Framework Team Role so people can decide whether to trust your answer or just take it with a grain of salt.
I’m also checking if it’s possible to install and boot OS from expansion cards, my understanding is that Windows installation on expansion cards is not supported, but you should be able to try with Linux. Speed will be higher and power consumption will be lower running the OS on the internal storage drive, but you can install Linux on a Storage Expansion Card. Let us know if you have any issues if you do!
Just to clarify It is definitely possible to install windows on an external storage expansion card. My question was more about performance/issues and from what some users have told me they haven’t had any problems. How to install Windows 10 on an expansion card - #11 by Cheese
Interestingly, inference performance stays the same [when using a build of ROCm patched to support the APU], about 7.1-7.8 tok/s for deepseek-r1:14b q4 on this hardware (AMD HX 370, LPDDR5X 7500), when I run all of these variants:
Stock Ollama CPU, num_threads=20 (it’s a 12-core + SMT CPU)
llama.cpp with Vulkan
llama.cpp with AVX512
ollama-for-amd with ROCm
So, most likely it’s a memory bandwidth limitation. The CPU is free to do other things if GPU is used, of course.
So, I would not advise using the AMD Ryzen AI 300 series to learn about AI.
However, on my own laptop I have been able to use Ollma 0.8.6 for Windows to run the gemma3 4 billion (4b) parameter LLM with the open-webui Docker container as a web frontend on Docker Desktop on Windows 11, and I get a response rate of 16 tokens/second. Gemma herself claims that is on the low-end of what to expect from a 4090. So, if you just want to play with some LLMs and not use PyTorch directly, this is fine.
The response to this prompt was generated in about 1 minute and 16 seconds.
Prompt:
What’s a reasonable response token rate for running the latest Gemma model, with 4 billion parameters, on consumer hardware?
Response:
Okay, let’s break down a reasonable response token rate for running the latest Gemma 4B model on consumer hardware. This is a complex topic with a lot of variability, but here’s a realistic assessment, considering current performance as of late 2023/early 2024:
1. Key Factors Affecting Token Rate:
Hardware: This is the biggest factor.
GPU: A high-end NVIDIA GeForce RTX 4090 (24GB VRAM) is currently the sweet spot for running 4B models reasonably well on consumer hardware. An RTX 3090 (24GB) will also work, but might be slower. Lower-end GPUs (RTX 3060 or lower) will be extremely limited.
CPU: A modern CPU (e.g., Ryzen 7 or Intel i7) is needed to handle data transfer and processing.
RAM: 16GB is a minimum, 32GB is highly recommended. The model needs space for computations and temporary storage.
Storage: An SSD is crucial for fast loading of the model.
Quantization: This is the technique that dramatically impacts performance. Gemma 4B supports quantization, allowing you to represent the model’s parameters with fewer bits (e.g., 8-bit or 4-bit).
4-bit Quantization (GPTQ/GGML): This is the most impactful technique for consumer hardware. It significantly reduces memory usage and speeds up inference.
8-bit Quantization: Still a performance improvement over full-precision, but less dramatic than 4-bit.
Inference Framework: The software you’re using to run the model matters.
llama.cpp/GGML: This is a popular, optimized framework for running LLMs on CPUs and GPUs, particularly well-suited for 4-bit quantization. Generally delivers the best performance on consumer hardware.
vLLM: A more sophisticated framework focused on efficient throughput. Might provide slightly better speeds with more optimization, but can be more complex to set up.
Transformers (Hugging Face): Can be used, but often less optimized than llama.cpp/vLLM for this model size.
Prompt Length & Model Complexity: Longer prompts and more complex tasks will naturally take longer to generate responses.
Decoding Strategy: The method used to generate the next token (e.g., greedy decoding, beam search) can affect speed.
2. Realistic Token Rate Estimates (as of Late 2023/Early 2024):
Here’s a breakdown based on the above factors:
Optimistic (RTX 4090, 4-bit Quantization, llama.cpp):15-25 tokens per second. This is achievable with shorter prompts and relatively simple tasks. You could see higher rates with careful prompt engineering.
More Typical (RTX 4090, 4-bit Quantization, llama.cpp):8-15 tokens per second. This is a more realistic expectation for many use cases.
RTX 3090 (24GB), 4-bit Quantization, llama.cpp: Likely to be slightly slower than the RTX 4090, maybe 6-12 tokens per second.
RTX 3060 (12GB), 4-bit Quantization, llama.cpp: This will be significantly slower, potentially only 3-7 tokens per second, and may require aggressive prompt truncation to avoid out-of-memory errors.
3. Important Notes & Caveats:
These are estimates. Actual token rates will vary significantly based on the specific workload.
Latency: Don’t just focus on token rate. Latency (the time it takes to receive a single token) is also important.
Out-of-Memory Errors: 4B models can still be memory-intensive. Careful prompt engineering (truncating prompts) is often needed to avoid errors.
Ongoing Optimization: Inference frameworks are constantly being optimized. Expect future versions to offer improved token rates.
So I guess it will be a matter of time before amd supports the gfx1150 890M and programs like LM studio will be able to make use of the igpu instead of the cpu, right? This will not significantly improve performance because of the bandwidth limitation but it will free up the cpu for other tasks.
To be honest I’m just interested in playing with llms and if possible to replace chatgpt for my daily use for privacy issues, so as far as I can see the quality is pretty decent and viable from the prompt you shared.
Since you have the model with the configuration I was thinking of, if you can answer the question: How about fan noise and battery life at low and moderate workloads on Linux?
AMD has not made any indication that they are going to officially add ROCm support for gfx1150/890M, unfortunately.
I’m using Windows as my OS, with Windows Subsystem for Linux (WSL) when I’m doing Linux tasks, so I can’t comment on Linux use, but the system says silent during low workloads. During the LLM prompt the fans were loud but nothing unbearable. I haven’t had the system long enough to get a full idea of the battery life, but it’s more than enough to get you through at least a few hours of casual use.
Well as you say it’s a shame that amd is releasing a series with the ‘AI’ branding and not supporting key features such as the use of igpu to run llms.
Thank you very much for taking the time to answer in such detail! It has helped me a lot and I’m sure it will help other users as well.
That’s somewhat disappointing (not the FW Team’s fault). I was rather thinking that an AMD AI chip would be fully supported by ROCm. I, too, ordered one hoping to run offline models taking advantage of the AI chip and iGPU.