Responsible LLM Use

(For context I have an i5-1334U Framework 12 w/single-channel 48GB RAM. I was not originally purchasing the system with the intent of running AI locally, nor do I have the budget for a system like that.)

I’m trying to find a way to use LLMs responsibly. Of course I could just use online services in extreme moderation, but I’ve found that LLM help accelerates much of my work greatly without reducing quality or my own knowledge- I usually use it for tedious things or personalized explanations or to write code I don’t have the time to get familiar with yet. I’ve tried running ollama locally but the only model that’s smart enough for my use case and fits in RAM (qwen3-vl:32b) is extremely slow to run and I often find myself waiting an hour or more per response, which negates the benefit of using an LLM in the first place. But I also don’t want to put more demand on online services and give away my convos to big corpos + inflate their usage numbers. What do I do?

Trying to put a LLM on a Framework 12 is going to get the results most people would expect. It does not have the memory capacity nor the processing power to run it.

Purchase a machine (or better yet build one) that can handle those requests separate from the Framework 12. Run the model there and remote into the machine.

If this is not affordable, then subscribing to a service is the only viable option if the dependence on LLM is necessary. Then when more money is saved up, purchase the hardware to run a self-hosted setup.

Sorry for the painfull truth but your current hardware is way to weak for bigger local LLMs (qwen3-vl:32b). The single channel RAM makes it even worse.

Even on the Framework Desktop (board) the qwen3-vl:32b model has only about 10 tokens per second of token generation.

Prompt processing needs compute power and token generation needs RAM bandwith.

The problem is that this model is a dense model. You could try the qwen 3 (vl) 30B A3B model in one of it’s flavors (instruct, think, coder) because this is a MOE (mixture of experts) model and this will be way faster.

Additionally you could try llama.cpp instead of Ollama because this might also give more performance BUT don’t expect something like 2x, 3, … performance increase.

Regarding online models you could try the Mistral AI models were the provider is located in France (Europe). At the end you have to trust the vendor.

Update: I’m running the Q4_K_M quant of Qwen3-VL:32b on llama.cpp w/16k context at like 0.8 tok/s. About as smart as the free version of ChatGPT, whatever the latest model for that may be, with double the context. Good enough for me. Better than straining the people’s water and power supplies near datacenters.

I have 48GB of RAM and usually have ~32gb free at any time on Win11. During inference I got like 2-5GB free so I can comfortably use the computer during inference if I set the process priority of llama.cpp to Low.

LM Studio should run 7B to 14B models at useable speeds (3TPS) on that hardware.

I use it on the go on my FW13 with 7640u during commute all the times. I use it to add docuentation to python functions, generate examples for code libraries, and talk with it to brainstorm D&D campaign ideas and help me generate dozens of NPCs thematic for a village for the campaign.

Don’t expect capabilities like image generation, or higher quality code and text generation.

1 Like