Framework 16 and Deep Learning

Cesare_Montresor · May 23, 2024, 7:41am

In the official demo I’ve seen they always run it with the additional power adapter, which has to be at least 25W, so you can’t really use your phone charger for it and you would have to have a dedicated one.
Running it from your laptop it’s likely to put an additional strain to the battery.

Jimster480 · May 26, 2024, 7:20am

Definitely will put an additional strain on the battery as it is 20W more draw!

samokosik · May 26, 2024, 9:23am

Question: Do you have to run it locally? Gitlab offers nvidia T4 for free unlimited.

edf · June 16, 2024, 2:10am

I found this out myself the other night, after poking around a bit. I came here to see if there are any Framework16 users running pytorch.

I spent a fair bit of time fighting the native CUDAness of pytorch, so I thought I would post the code I got working here.

First, to select the device that has the highest VRAM, falling back to CPU (device -1):

from os import putenv
putenv("HSA_OVERRIDE_GFX_VERSION", "11.0.0") 
import torch

cuda_dev = -1
if torch.cuda.device_count() > 0:
    print([ torch.cuda.get_device_properties(x).name for x in range(torch.cuda.device_count()) ])
    arr = [ torch.cuda.get_device_properties(x).total_memory for x in range(torch.cuda.device_count()) ]
    cuda_dev = arr.index(max(arr))
torch.cuda.set_device(cuda_dev)
torch.cuda.empty_cache()

To force use of device 0 (the Radeon 7700):

putenv("ROCR_VISIBLE_DEVICES", "0")
import torch # the ENV var should precede torch import

Some debugging code to make sure everything is sane:

print("CUDA device   :" + torch.cuda.get_device_properties(cuda_dev).name)
print("VRAM total    : %d" % torch.cuda.get_device_properties(cuda_dev).total_memory)
print("     allocated: %d" % torch.cuda.memory_allocated(cuda_dev))

This works for most models when using device_map=“auto”:

model_name = "TheBloke/CodeLlama-7B-Python-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

from rocm-smi:

VRAM%  GPU%
58%   100%

For larger ones like StableDiffusion which will not fit entirely on the GPU, you will get an HIP OutOfMemory error:

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.62 GiB. GPU

You can try to use accelerate, but it fails on models such as StableDiffusion which do not have named parameters. But cpu_offload() works pretty well:

from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(model_dir, torch_dtype=torch.float16, local_files_only=True, text_encoder_3=None, tokenizer_3=None, low_cpu_mem_usage=False, device_map=None).to("cuda")
pipe.enable_model_cpu_offload()

Again, from rocm-smi:

VRAM%  GPU%
84%   100%

Getting back to that accelerate, here is the config file it generated when I instructed it to use GPUs 0 and 1 (Radeon 7700 and the Ryzen integrated GPU, respectively):

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The problem with accelerate is that unless you can run infer_auto_device_map(model), it doesn’t split the model across multiple GPUs, and it just falls back to the CPU. And when the model doesn’t have named parameters, infer_auto_device throws an error.

Well, that’s where my Framework 16 is at in terms of PyTorch usage. Hope this is useful.

EDIT: After looking into the accelerate docs, I found a reliable way to load to the 7700, then the integrated CPU, then finally to memory/disk, using just diffusers/transformers:

memspec={0: "8GiB", 1: "3GiB"} 
model = AutoModelForCausalLM.from_pretrained(model_name, local_files_only=True, device_map="auto", max_memory=memspec, torch_dtype=torch.float16, offload_folder="/tmp/accelerate-weights")

It automagically works just as you’d like it to. I set device 1, the iGPU, to 1GB less than I have allocated to it in BIOS (4GB), as I had observed that my system uses a little under 1GB of memory on the iGPU during normal use. Just for fun, I tried setting the value for the iGPU to “4GiB” and got the following error:

RuntimeError: CUDA error: HIPBLAS_STATUS_ALLOC_FAILED when calling hipblasCreate(handle)

So that’s what happens when you reach too far.