FW13 AI 370 performance?

Eric_Orso · June 3, 2025, 9:49am

I’m considering an upgrade from the 7640u, I use LM Studio with Vulkan runtime and I’m getting around 99.67956083211645 T/s in question and 8.321317222045057 T/s in answer in qwen2.5-7b-instruct.

For reference my 7900XTX desktop on the same model gets 440.4048086984358 T/s in question and 86.09564028455621 in answer on the same model.

I’m looking to an upgrade for my FW13, I was looking at the various AI boards, but they have the same 5600 DDR5 as my board, and they list an 860m GPU. I have been looking around for benchmark but I haven’t found anything.

Does somebody knows of LM Studio performance on those boards?

I know the 7640u doesn’t support ROCm acceleration, but that’s fine, I use Vulkan acceleration. More worringly the 7640u is supposed to have NPU, but that’s dead silicon, that’s not supported nor used by anything. I searched for information but found nothing, does the NPU on the AI300 series have drivers and is used by applications?

fish_177 · June 3, 2025, 2:27pm

Honestly if inferencing performance is what’s driving you I’d probably skip the upgrade - you’re still going to be heavily limited by memory bandwidth, especially on larger models - fw is still on sodimms which means you’re still stuck with about 90GB/s for memory bandwidth, even if you can get much more compute in the 370.

I think AMD published xdna driver for npu on Linux side in 6.14, not aware if anything can use it, or what things are like on the windows side.

Charlie_6 · June 3, 2025, 2:35pm

In the product specification

NPU Up to 50 TOPS.
Does that mean AI 5 340, AI 7 350 and AI 9 HX 370 have identical AI performance?

Adrian_Joachim · June 3, 2025, 2:37pm

They probably all got the same NPU just as they all got the same hw encoders/decoders and memory controller and stuff but only the ai performance using the npu would be the same, stuff using cpu/gpu would be quite different.

fish_177 · June 3, 2025, 2:48pm

Can’t find any actual published benchmarks which is odd, but I think some people actually have their machines now so maybe someone would be willing to run your model in lm studio and tell you what they’re getting.

But again I’d expect maybe big uplift in eval performance and maybe very little in response performance, because of memory bandwidth limitation.

Fn1 · June 3, 2025, 7:55pm

Exactly, 7900XTX has memory throughput of 960GBbps, while SODIMM DDR5 only 83GBps, this is no different on the newer FW13, so the output token performance will be the same as on your 7640u.

Eric_Orso · June 4, 2025, 6:37am

It does seem they have the same NPU, meaning the NPU takes more % of the die on lower tier HX300 chips. It also means that if the NPU did accelerate applications, the lower tier would have very similar performance. It has interesting implication for local VS code assist.

It seemed weird to me too. it makes me thing the driver support isn’t there yet.

Yup, I’m going to skip this upgrade. Hopefully the next upgrade is going to use a dual channel CAMM module!

Thanks for the info!

Fn1 · June 4, 2025, 8:09am

Yup, I’m going to skip this upgrade. Hopefully the next upgrade is going to use a dual channel CAMM module!

Unfortunately that’d be a minimal upgrade as well, performance wise. A dual channel LPCAMM2 is 120GBps. I wonder if in the future one could get a combination of soldered quad channel RAM (276GBps) for graphics/LLMs and an LPCAMM2 slot on the same board.

Joe_Name · June 4, 2025, 10:28am

What about being able to run larger models? The new AI series of mainboards share more RAM with the GPU - so although there isn’t a speed increase, the ability to run larger models should work.

Fn1 · June 4, 2025, 10:47am

You can run the models on the cpu and use all the memory you have, there won’t be much of a difference (if any), the bottleneck is in memory bandwidth, not compute

Eric_Orso · June 4, 2025, 11:22am

With my 7640u I can already use lots of RAM to very near performance. I can do 30000 tokens as context window on 7B models, and 15000 tokens on 14B models with ok inference speed. The iGPU has 4GB dedicated with a bios toggle, but it’s all on the same bus, and as far as I can tell there isn’t a larger penality on larger models. There seem to be a limit on using 1/2 of RAM but I’m not clear on that. It’s useable for 7B and 14B models that are competent enough to use on the go. Bigger models are too slow anyway because of memory bandwidth for me to use on the go. E.g. yesterday I asked the model how to format matplotlib charts and it does it no issues.

The Framework desktop uses another chip. The AI MAX 395 it has double the memory controllers at 256 bit four channel, and at faster speed 8000 with soldered ram. It’s basically purpose built as inference machine and there are people using it (https://www.youtube.com/watch?v=S264zdYB-rw)

But it’s not on laptop form for now. For a desktop, I already have a GPU and that shreds the 395, tough I have not tested with deep ram spillage on larger models. It’s possible that the 395 might beat my 7900XTX on a very narrow range of models that fits ram but spill enough from VRAM, I guess 70B high quant to 200B low quant, perhaps? Too niche for me to buy

If there was a 395 laptop board, I would seriously consider it, but I can understand why Framework couldn’t do it this time around. I’ll wait for the next generation of chips for an upgrade. Hopefully by then tere will be superior LLM architectures like better MoE models that are less horribly bandwidth limited and more compute limited. I also need useful drivers for the NPU if anything because the NPU is vastly more efficient at doing tensor operation, and that should translate to better battery life under inference workloads.

Joe_Name · June 4, 2025, 12:45pm

Yes, I hope the same or similar becomes available for the FW16… But, what I meant, was more RAM was shared with the GPU in the AI mainboards, so you would be able to load larger models in GPU RAM - much faster than CPU only inference.

Fn1 · June 4, 2025, 1:26pm

I don’t think that’s true , the GPU shares the same memory as CPU and memory is the main bottleneck, so the output token performance will be the same.

fish_177 · June 4, 2025, 3:03pm

Basically all iGPUs share ram with the CPU - that’s not a new feature of these new chips. I can already run very large models on my 7840U GPU (96 GB ram, set 64 GB as gtt). The problem is performance is dismal because of memory BW.

I get about 2.5 tokens per second running a 32B parameter model with 128k context using ollama with rocm backend and that’s bottlenecked by the 90 GB/s memory bandwidth.

I don’t know if I’d call it minimal - 5.6GT/s sodimm to 7.5 GT/s lpcamm2 is a 33% increase. That’s significant. It’s still not enough to run really large models, though, that’s true. Even strix halo at 256 GB/s probably can’t run 70B q4 models over 4.5 tps, although there are basically no benchmarks of that yet. Still, if I could get another 30% memory BW, I’d upgrade right now. It’s a real shame AMD doesn’t allow the 385 to cTDP down to 30w. I’d love that in a fw13.

Fn1 · June 4, 2025, 3:10pm

Fair, I meant in the context of - even with lpcamm2 in the future we’d still be far from the performance that one can get today with m4 pro (275GB/s), not to mention m4 max (410GB/s).

Joe_Name · June 4, 2025, 3:13pm

I thought I would do some testing - because I just don’t know. I have a FW16 with 96G of RAM, and I have 8G allocated to the GPU - which is the most I can allocate.
I’m testing with msty.app and LM Studio. msty.app doesn’t use the iGPU, and LM Studio does.
1st test - using gemma3 with this prompt; Please type the procedure for calculating the square root of a positive integer by hand.
for LM Studio, I got 6.00 tok/sec 1247 tokens 1.20s to first token
for msty.app I got 17.08 tok/sec 1186 tokens 2.36s to first token
Gemma is too large to be ran in the GPU memory only, it was offloading 30/48 layers
gemma 3 1b is next

LM studio GPU - 61.44 tok/sec 678 tokens 0.18s to first token
LM Studio CPU 43.29 tok/sec 817 tokens 0.21s to first token
msty.app CPU - 44.15 tok/sec 983 tokens 0.90s to first token

A slight increase for GPU inference?
completely unscientific test, ymmv

What I use for local inference - m2 pro mac mini with 32G. same prompt with gemma12b:
prompt_tokens/s: 88.48

Fn1 · June 4, 2025, 3:16pm

Are you measuring prompt tokens or eval tokens ? (Prompt tokens are absolutely going to be faster on GPU, but output tokens will be limited by memory bandwidth)

Joe_Name · June 4, 2025, 3:19pm

@Fn1 just the output - and I don’t know which it is. The macmini it’s prompt tokens… But the others is just the output from the generation. Like I said, completely unscientific. Sorry.

Adrian_Joachim · June 4, 2025, 3:20pm

The most you can allocate statically, dynamically it can grab as much as the kernel lets it, on linux that’s by default half the ram but it’s configurable.

Joe_Name · June 4, 2025, 3:39pm

It’s the same in Linux - it’s a BIOS setting. I’ve tried Arch, PopOS, and NixOS. The RAM speed is a constant. But the GPU can run inference faster - but not exponentially so - it’s limited by the RAM speed. But, being able to load the entire model in VRAM (even virtual) helps when running larger models.

Topic		Replies	Views
Help Me Make Up My Mind (FW13 Ryzen AI 9 HX 370) Framework Laptop 13 framework-laptop-13-amd-ai-300 , ai	17	518	May 6, 2025
AI 9 HX 370 Noise and Heat etc. Compared to 7040 Series Framework Laptop 13 framework-laptop-13-amd-7040 , framework-laptop-13-amd-ai-300	17	618	June 3, 2025
AI 9 HX 370 Non-Soldered Memory Performance Framework Laptop 13 framework-laptop-13-amd-ai-300	1	436	March 20, 2025
Framework 13 - 128 GB RAM - AI 370 HX Framework Laptop 13 framework-laptop-13-amd-ai-300	5	453	May 15, 2025
AI Performance Framework Desktop ai	4	876	April 13, 2025

FW13 AI 370 performance?

Related topics