Request: verify dGPU support

I have some question on that,

  • What did you use as LLM server that can do that?
  • the KV is split over all layer, so doing so need many echange from RAM ↔ VRAM, with only a 4xPCIe it is realy slow (8Gb/s vs 256Gb/s)
  • did you have any bench on other platform of that config?
  • using NVIDIA dGPU need to mix CUDA/HIP at runtime what do you use for that?
  • what speed up did you expect?
1 Like