Setup: Framework desktop, Ryzen™ AI Max+ 395 - 64GB.
(X)ubuntu 24.4.3, kernel 6.14.0-37-generic.
running LMStudio LM-Studio-0.3.36-1-x64,
Nothing else running other than the cli.
dmesg | grep “amdgpu.*memory” :
[ 5.129919] [drm] amdgpu: 512M of VRAM memory ready
[ 5.129920] [drm] amdgpu: 31789M of GTT memory ready.
I had a problem with shared memory on the Framework, and while I found a workaround,
I don’t understand why the solution worked, or whether there isn’t a better solution.
Trying to load models where model+context was above a certain size
the model would fail to load and I would get an error
Failed to load model
Failed to initialize the context: vk::Queue::submit: ErrorDeviceLost
example:
gemma-3-27b-it-qat-q4_0-gguf with context length 32k or below loads
but 34k context or above gave the above error.
Testing with other models gave a similar result but with a different context length,
Note that it wasn’t the program warning
“Model loading was stopped due to insufficient system resources.”
but an actual failure that disabled Vulkan and required effectively
restarting the system for the device to become un-lost.
Watching htop as the models loaded, in all cases
I noticed the load failure would happen when the memory usage rose to about 34.9gb (out of 62.1gb).
So as the system was using ~2.1gb, about 20gb was “missing”.
So, using the debug method of try everything non-fatal once
I changed a run parameter from the default:
Offload KVCache to GPU memory on
to
Offload KVCache to GPU memory off
I was then able to adjust the context up and load models+context up to 55.6gb/62.1gb,
meaning no real missing memory, allowing much greater context window sizes. And if the total size was too large, I would now get the “insufficient system resources" warning.
So maybe the KVCache was locking the shared GPU memory
limiting the model + context window to the remaining memory?
but the numbers dont seem to add up correctly for that (GTT memory was 31gb).
Also Linux is supposed to allocate GTT memory dynamically, so it isn’t removed from the CPU’s memory pool until it is allocated.
So my question is:
why was my model + context length memory usage limited to 34.9gb
until I turned Offload KVCAche to GPU memory to “off” ?
Grateful for any insights into how shared memory on the Framework works.