Let’s not get too caught up in hardware selection. If you can justify the expense of 128GB worth of Nvidia GPUs and the rig or cluster needed to run them, knock yourself out - it will certainly outperform a Strix Halo or DGX Spark platform. The main appeal of these platforms is they allow you to run medium-sized (30-100b) models locally for a fraction of the cost of an equivalently-sized set of GPU cards, and without the orders-of-magnitude slowdown that comes with running entirely on CPU.
Likewise, let’s not concern ourselves with whether to run models in the cloud or locally. There are many arguments for running locally which have nothing to do with ethics or performance: my work, for example, cannot by contract be used on cloud servers, so use of cloud-based AI is out for anything but broad brainstorming. Of course, an interest in the underlying technology, a habit of tinkering, amd a drive towards simplicity are additional factors.
So we’re here and we’re talking about Framework desktops and the models which can be coerced to run on them (which seems to cap out at the MiniMax M2 series - I have not yet found a way to run them without resorting to egregiously low quantizations, or a confined context size).
Now, I know some pedant is going to come along and correct this no matter what I write, so let’s just dive right in. First up, number of parameters. Personal observation indicates that any model with at least 10b parameters contains the “copy of the internet” that makes it perform like a search engine. Reasoning requires additional parameters: the “better”/stronger the reasoning, the more parameters required, with reasoning starting to be usable at around 30b (try the various Olmo and QWQ sizes to observe the improvement). Tool use also requires additional parameters, with more parameters translating to more accurate tool selection and invocation, and again the 30b-sized models seem to be the practical minimum. There are model variants like Mixture of Experts (MoE) and “dense” models that can do more with less, but for simplicity let’s just say 30GB is a good minimum model size, and that anything larger than that should provide more reliable tool use and more cogent reasoning (not necessarily more knowledge, which is what is often assumed).
Next, context window. In your puzzle experiment, you encountered two things: a context window limit, and the context compaction/compression meant to compensate for that limit. The latter is important to know about, as every chat with an AI will eventually bring it into play. Briefly, there is a smaller model that takes the context window (the chat so far, all uploaded documents, etc) and condenses it to a summary text which will fit in the model’s context window. Depending on how this subsystem works, the conversation could degrade over time, as the model creates a summary of a summary of a summary, or the model discards one-time responses that are key to the problem at hand but which statistically appear unimportant (for example, because the user finds them so obvious they do not discuss them further).
The context window itself brings us back to the amount of GPU memory required. A 256K context window can require 25GB of GPU memory per instance - again, this is from observation, and it depends on both the quantization of the model and that of the OLLAMA cache (or equivalent in other software). I found this surprising, as it implies that 256-thousand tokens takes 25-billion bytes of memory or 100K per token, which just seems insane. So there is clearly more to context windows than I am aware of, but it’s a lot, and it has to be in the GPU memory just like the parameters. This is often expressed as, “you can have a large model or a large context window, but you cannot have both”, though with a unified-memory architecture like Strix or Spark you often can.
So, to take the long winding road back to your question: the size of the model before quantization determines the quality of the results it produces, and included in that size are the capabilities required to produce those results (reasoning, tool use, etc). Quantization will reduce that quality, but the general view seems to be that degradation only becomes noticeable at Q3 and lower - so it is better to run a Q4-quantized large model than a full-sized small model. The context window determines how much raw data you can give the model to work on. The choice of discrete GPU cards versus unified-memory architectures does not impact either: discrete cards will perform the computation faster, unified memory will allow larger models and context windows to be used.
My personal view on the matter is that AI is THE slowest way to solve a problem computationally. If you want a result computed quickly, use standard software. AI models save you time by reducing the actual work you need to do, not by being amazingly fast 
EDIT: While working on getting MiniMax Q3 working, I asked it about context windows, because attempting to search for implementation details on Google just got me four pages of the same AI-generated cursory overview.
Some highlights:
-
When you set num_ctx=131072, Ollama allocates memory for KV-caches
across all possible layers, not just what’s needed for your current
session
-
The KV-cache grows proportionally to: (layers × context_length × hidden_dim)
…hence the wildly different memory requirements for the same context size across different models.