I’m considering an upgrade from the 7640u, I use LM Studio with Vulkan runtime and I’m getting around 99.67956083211645 T/s in question and 8.321317222045057 T/s in answer in qwen2.5-7b-instruct.
For reference my 7900XTX desktop on the same model gets 440.4048086984358 T/s in question and 86.09564028455621 in answer on the same model.
I’m looking to an upgrade for my FW13, I was looking at the various AI boards, but they have the same 5600 DDR5 as my board, and they list an 860m GPU. I have been looking around for benchmark but I haven’t found anything.
Does somebody knows of LM Studio performance on those boards?
I know the 7640u doesn’t support ROCm acceleration, but that’s fine, I use Vulkan acceleration. More worringly the 7640u is supposed to have NPU, but that’s dead silicon, that’s not supported nor used by anything. I searched for information but found nothing, does the NPU on the AI300 series have drivers and is used by applications?
Honestly if inferencing performance is what’s driving you I’d probably skip the upgrade - you’re still going to be heavily limited by memory bandwidth, especially on larger models - fw is still on sodimms which means you’re still stuck with about 90GB/s for memory bandwidth, even if you can get much more compute in the 370.
I think AMD published xdna driver for npu on Linux side in 6.14, not aware if anything can use it, or what things are like on the windows side.
They probably all got the same NPU just as they all got the same hw encoders/decoders and memory controller and stuff but only the ai performance using the npu would be the same, stuff using cpu/gpu would be quite different.
Can’t find any actual published benchmarks which is odd, but I think some people actually have their machines now so maybe someone would be willing to run your model in lm studio and tell you what they’re getting.
But again I’d expect maybe big uplift in eval performance and maybe very little in response performance, because of memory bandwidth limitation.
Exactly, 7900XTX has memory throughput of 960GBbps, while SODIMM DDR5 only 83GBps, this is no different on the newer FW13, so the output token performance will be the same as on your 7640u.
It does seem they have the same NPU, meaning the NPU takes more % of the die on lower tier HX300 chips. It also means that if the NPU did accelerate applications, the lower tier would have very similar performance. It has interesting implication for local VS code assist.
It seemed weird to me too. it makes me thing the driver support isn’t there yet.
Yup, I’m going to skip this upgrade. Hopefully the next upgrade is going to use a dual channel CAMM module!
Yup, I’m going to skip this upgrade. Hopefully the next upgrade is going to use a dual channel CAMM module!
Unfortunately that’d be a minimal upgrade as well, performance wise. A dual channel LPCAMM2 is 120GBps. I wonder if in the future one could get a combination of soldered quad channel RAM (276GBps) for graphics/LLMs and an LPCAMM2 slot on the same board.
What about being able to run larger models? The new AI series of mainboards share more RAM with the GPU - so although there isn’t a speed increase, the ability to run larger models should work.
You can run the models on the cpu and use all the memory you have, there won’t be much of a difference (if any), the bottleneck is in memory bandwidth, not compute
With my 7640u I can already use lots of RAM to very near performance. I can do 30000 tokens as context window on 7B models, and 15000 tokens on 14B models with ok inference speed. The iGPU has 4GB dedicated with a bios toggle, but it’s all on the same bus, and as far as I can tell there isn’t a larger penality on larger models. There seem to be a limit on using 1/2 of RAM but I’m not clear on that. It’s useable for 7B and 14B models that are competent enough to use on the go. Bigger models are too slow anyway because of memory bandwidth for me to use on the go. E.g. yesterday I asked the model how to format matplotlib charts and it does it no issues.
The Framework desktop uses another chip. The AI MAX 395 it has double the memory controllers at 256 bit four channel, and at faster speed 8000 with soldered ram. It’s basically purpose built as inference machine and there are people using it (https://www.youtube.com/watch?v=S264zdYB-rw)
But it’s not on laptop form for now. For a desktop, I already have a GPU and that shreds the 395, tough I have not tested with deep ram spillage on larger models. It’s possible that the 395 might beat my 7900XTX on a very narrow range of models that fits ram but spill enough from VRAM, I guess 70B high quant to 200B low quant, perhaps? Too niche for me to buy
If there was a 395 laptop board, I would seriously consider it, but I can understand why Framework couldn’t do it this time around. I’ll wait for the next generation of chips for an upgrade. Hopefully by then tere will be superior LLM architectures like better MoE models that are less horribly bandwidth limited and more compute limited. I also need useful drivers for the NPU if anything because the NPU is vastly more efficient at doing tensor operation, and that should translate to better battery life under inference workloads.
Yes, I hope the same or similar becomes available for the FW16… But, what I meant, was more RAM was shared with the GPU in the AI mainboards, so you would be able to load larger models in GPU RAM - much faster than CPU only inference.
Basically all iGPUs share ram with the CPU - that’s not a new feature of these new chips. I can already run very large models on my 7840U GPU (96 GB ram, set 64 GB as gtt). The problem is performance is dismal because of memory BW.
I get about 2.5 tokens per second running a 32B parameter model with 128k context using ollama with rocm backend and that’s bottlenecked by the 90 GB/s memory bandwidth.
I don’t know if I’d call it minimal - 5.6GT/s sodimm to 7.5 GT/s lpcamm2 is a 33% increase. That’s significant. It’s still not enough to run really large models, though, that’s true. Even strix halo at 256 GB/s probably can’t run 70B q4 models over 4.5 tps, although there are basically no benchmarks of that yet. Still, if I could get another 30% memory BW, I’d upgrade right now. It’s a real shame AMD doesn’t allow the 385 to cTDP down to 30w. I’d love that in a fw13.
Fair, I meant in the context of - even with lpcamm2 in the future we’d still be far from the performance that one can get today with m4 pro (275GB/s), not to mention m4 max (410GB/s).
I thought I would do some testing - because I just don’t know. I have a FW16 with 96G of RAM, and I have 8G allocated to the GPU - which is the most I can allocate.
I’m testing with msty.app and LM Studio. msty.app doesn’t use the iGPU, and LM Studio does.
1st test - using gemma3 with this prompt; Please type the procedure for calculating the square root of a positive integer by hand.
for LM Studio, I got 6.00 tok/sec 1247 tokens 1.20s to first token
for msty.app I got 17.08 tok/sec 1186 tokens 2.36s to first token
Gemma is too large to be ran in the GPU memory only, it was offloading 30/48 layers
gemma 3 1b is next
LM studio GPU - 61.44 tok/sec 678 tokens 0.18s to first token
LM Studio CPU 43.29 tok/sec 817 tokens 0.21s to first token
msty.app CPU - 44.15 tok/sec 983 tokens 0.90s to first token
A slight increase for GPU inference?
completely unscientific test, ymmv
What I use for local inference - m2 pro mac mini with 32G. same prompt with gemma12b:
prompt_tokens/s: 88.48
Are you measuring prompt tokens or eval tokens ? (Prompt tokens are absolutely going to be faster on GPU, but output tokens will be limited by memory bandwidth)
@Fn1 just the output - and I don’t know which it is. The macmini it’s prompt tokens… But the others is just the output from the generation. Like I said, completely unscientific. Sorry.
The most you can allocate statically, dynamically it can grab as much as the kernel lets it, on linux that’s by default half the ram but it’s configurable.
It’s the same in Linux - it’s a BIOS setting. I’ve tried Arch, PopOS, and NixOS. The RAM speed is a constant. But the GPU can run inference faster - but not exponentially so - it’s limited by the RAM speed. But, being able to load the entire model in VRAM (even virtual) helps when running larger models.