I have some question on that,
- What did you use as LLM server that can do that?
- the KV is split over all layer, so doing so need many echange from RAM ↔ VRAM, with only a 4xPCIe it is realy slow (8Gb/s vs 256Gb/s)
- did you have any bench on other platform of that config?
- using NVIDIA dGPU need to mix CUDA/HIP at runtime what do you use for that?
- what speed up did you expect?