Tuning the Framework: Shared memory being blocked?

Setup: Framework desktop, Ryzen™ AI Max+ 395 - 64GB.
(X)ubuntu 24.4.3, kernel 6.14.0-37-generic.
running LMStudio LM-Studio-0.3.36-1-x64,
Nothing else running other than the cli.

dmesg | grep “amdgpu.*memory” :
[ 5.129919] [drm] amdgpu: 512M of VRAM memory ready
[ 5.129920] [drm] amdgpu: 31789M of GTT memory ready.

I had a problem with shared memory on the Framework, and while I found a workaround,
I don’t understand why the solution worked, or whether there isn’t a better solution.

Trying to load models where model+context was above a certain size
the model would fail to load and I would get an error
Failed to load model
Failed to initialize the context: vk::Queue::submit: ErrorDeviceLost

example:
gemma-3-27b-it-qat-q4_0-gguf with context length 32k or below loads
but 34k context or above gave the above error.
Testing with other models gave a similar result but with a different context length,

Note that it wasn’t the program warning
“Model loading was stopped due to insufficient system resources.”
but an actual failure that disabled Vulkan and required effectively
restarting the system for the device to become un-lost.

Watching htop as the models loaded, in all cases
I noticed the load failure would happen when the memory usage rose to about 34.9gb (out of 62.1gb).
So as the system was using ~2.1gb, about 20gb was “missing”.

So, using the debug method of try everything non-fatal once
I changed a run parameter from the default:
Offload KVCache to GPU memory on
to
Offload KVCache to GPU memory off

I was then able to adjust the context up and load models+context up to 55.6gb/62.1gb,
meaning no real missing memory, allowing much greater context window sizes. And if the total size was too large, I would now get the “insufficient system resources" warning.

So maybe the KVCache was locking the shared GPU memory
limiting the model + context window to the remaining memory?
but the numbers dont seem to add up correctly for that (GTT memory was 31gb).

Also Linux is supposed to allocate GTT memory dynamically, so it isn’t removed from the CPU’s memory pool until it is allocated.

So my question is:
why was my model + context length memory usage limited to 34.9gb
until I turned Offload KVCAche to GPU memory to “off” ?

Grateful for any insights into how shared memory on the Framework works.

This is all too much in depth for me to really comprehend. However, I also could not get LM Studio to use more than 50% of my unified memory with GUI settings, no matter the Linux kernel arguments for GTT usage.

I had no such problems however on llama.cpp, without changing any system settings. It used the entire GTT even up to 61GB out of 62GB.

1 Like

Apologies it was so detailed, I was trying to give enough information that someone who knew what they were doing would be able to explain.

The key points were:
1 I could not load a model where the total memory usage was greater than 34.9gb out of 62.1gb.

2 If I tried I got a hard error:
Failed to load model
Failed to initialize the context: vk::Queue::submit: ErrorDeviceLost

3 If I changed
Offload KVCache to GPU memory on → off
this constraint disappeared, and I was able to load models up to 55.6gb/62.1gb.
However t/s went down by around 25%.

I saw a post that suggested this was possibly a bug in LMStudio that caused the models to be loaded only into GTT “memory”. Your experience that you didn’t have a problem when using llama.cpp reinforces this.
If this is the bug, fixing it by just unloading the KVCache from GPU memory would help.

I intend to test further over the next week by increasing GTT to 56gb and seeing what effect it has.

Thanks for making it more succinct.

I think I get my misunderstanding. So you did not increase GTT to more than 50%. Yes, than also llama.cpp would fail.

My problem was that LMStudio was failing also after expanding GTT, to an amount that should be more than enough for the loaded model plus context. llama.cpp on the other side could use the full GTT that was defined by Kernel arguments, without any problems.

Interesting that it also failed when you increased GTT.
Did it fail with
vk::Queue::submit: ErrorDeviceLost
or
“Model loading was stopped due to insufficient system resources.”
?

I plan in a few days to get back to this and test expanding GTT,
testing both with
Offload KVCAche to GPU memory to “off” and “on”

I get an approximate idea of at what point of memory usage it fails by
watching htop as it loads the model.

One issue is I’m not ready to go to ‘bare metal’ just using llama.cpp directly as you are doing,
as I have just started so I appreciate the assistance LMStudio gives,
both finding and downloading models and setting model parameters.

Ollama is not an alternative as from:
strixhalo.wiki/AI/AI_Capabilities_Overview

“Ollama does not support Vulkan or AMD GPUs in general very well in general.
For this and other reasons, it is not recommended.”