Confused Flux 2 Dev - 130gb Overload

’m very new to Framework, and I’m also new to APUs/iGPUs, so I’d appreciate some straightforward advice.

I’ve got a Framework Desktop Max+ 395 with 128GB RAM on the way.

Specs:

  • 16 cores / 32 threads

  • 3.0GHz base clock

  • Up to 5.1GHz boost

  • 64MB L3 cache

  • Radeon 8060S integrated graphics

  • 128GB LPDDR5x-8000

  • Wi-Fi 7 and 5Gbit Ethernet

I already own a separate machine with a 9950X3D, 2 x RTX 3090 Founders Editions, and 96GB RAM. It’s a very strong machine, but the obvious limitation is VRAM, with 24GB per card.

One of the reasons I was torn between buying something like a 48GB workstation GPU and going for the Framework Desktop instead was because I use ComfyUI a lot, and more and more models are now appearing that simply do not fit into 24GB VRAM.

What’s confusing me is this: when I try to run something like Flux 2 Dev in ComfyUI, it seems to use huge amounts of memory — around 130GB — and then stops. I was under the impression this machine would be able to handle larger models because of the shared memory setup, so I’m struggling to understand why it’s eating all available memory and still not completing.

I’m not especially technical, so I’d really appreciate replies in plain English rather than anything too deep into Python, coding, or command-line fixes.

Has anyone got any advice on what I should realistically expect from this machine, and whether I’m misunderstanding how the memory works?

At the moment im feeling rather deflated

Flux 2 Dev is a 64gb safetensors so it should fit in the memory. I suspect that the KV cache is causing your existing machine to over 130gb. Compressing this or limiting it to 8 bit quant may help.

I assumes that your existing machine is running windows that takes up additional space as well?

in any case have you considered this path?

1 Like

Hi Jason,

Thanks for your reply — that makes sense around the runtime overhead and how memory usage can go well beyond the raw model size. I’ll have a closer look at how things are being loaded and whether limiting or compressing the cache has any impact.

Just to clarify my setup — I’m running Linux across both machines (Kubuntu on my main rig and Ubuntu on the Framework), so there’s no Windows overhead in play here.

I’ve actually got two separate machines by design. The main system started as a dual 3090 setup, and I’ve now added a PNY NVIDIA RTX Pro 5000 Blackwell (48GB), so I’ve got a bit more headroom on the CUDA side.

Alongside that, I’ve picked up the Framework/Strix Halo machine to explore unified memory. That’s less about replacing CUDA and more about seeing how far I can push larger models that don’t comfortably fit within even 48GB VRAM.

So effectively I’m running a stable CUDA setup alongside a higher-ceiling experimental system.

The toolbox approach you shared looks interesting — it seems like a more structured way of handling the ROCm and memory side of things, so I’ll take a proper look rather than trying to brute-force it in a standard setup.

Appreciate you pointing me in that direction.

Best,Dunc