Wrong device when using device="cpu" with torch.device #503

chrisdesa · 2024-07-24T20:10:59Z

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.43.1
Platform: Linux-5.15.0-112-generic-x86_64-with-glibc2.35
Python version: 3.12.2
Huggingface_hub version: 0.23.2
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0.dev20240513+cu121 (True)
GPU type: NVIDIA H100 80GB HBM3

Information

The official example scripts
My own modified scripts

Reproduction

with torch.device('cuda:0'):
    safetensors.torch.load_file('filename.safetensors',device='cpu')['key'].device

Expected behavior

The expected behavior is for the device='cpu' argument to override the torch default and load the tensors on the CPU, but the actual behavior is it loads the tensor onto the 'cuda:0' GPU. The device='cpu' argument seems to be interpreted as "whatever the default device is right now" and not as the actual CPU.

The text was updated successfully, but these errors were encountered:

Narsil · 2024-07-31T10:19:18Z

I understand the concern, however I do not think this should be fixed.

with torch.device(x)
Changes the default allocation place, which messes a lot of internal logic in safetensors (it doesn't make it wrong it makes it highly inefficient, since your first allocating on device before moving back to the CPU which about the slowest implementation possible).

The behavior is changeable, by being extremely defensive about those global modifiers that are the context managers. However as a general rule, I'd avoid using any rules depending on torch internals, since they tend to change a lot (this context manager didn't exist when this library was created).

Also, in the future, there might be ways to skip CPU allocation entirely meaning the internals would have to be rewritten again, introducing potential breaking changes (which is highly avoided in this library).

Given all these considerations, I'm marking this as wontfix, and instead encourage you and other users to not use contradicting device locations from a context manager.

So:

Either

with torch.device("cuda:0"):
    # Fix the context manager, that make everything much faster.
    with torch.device("cpu"):
      weights = load_file(filename)

Or

# Remove the context manager.
weights = load_file(filename, device="cpu")

Just to note that

with torch.device("cuda:0"):
   weights = load_file(filename)

is totally valid intent behavior and will work transparently currently (contrary to what would happen should we be defensive).

sdake · 2024-09-02T12:25:35Z

I agree with you broken designs (context managers have all kinds of problems, and we are inventing high performance compute here, where cycles matter), so its better to say no.

How viable would a deprecation warning be? If context manager -> poof.

sdake · 2024-09-02T12:40:41Z

@chrisdesa May I interpret what is going on here? I believe your assertion as to what is happening isn't quite right.

Here are the steps:

load_file is set to load to cpu.
Because of legacy pytorch behaviors, the CPU copies the model from the mmap() to a newly allocated malloc().
the context manager then executes a Tensor.to() with the desired GPU target.
This causes the CPU to copy the Tensor to the GPU using a host to device copy.

One of the problems with this flow is that the CPU is entirely responsible for pushing data around. Another problem is that the CPU is executing the memcpy() operation, which is very expensive. Cuda is a client-server system (The client is the cuda API, the server is the kernel driver, firmware, and parts of the chip control plane). It is possible to offload the memory copy from the cpu to various DMA engines. As an example, it would be trivial to execute a host to device copy async in a CUDA stream (a stream is a client/server instance). The CudaMemcpyAsync() API to ask the GPU hardware (using DMA) to copy the Tensors to the GPU.

There are more advanced techniques that could involve no host memory buffer copies at all (why copy, when you could map instead?). For more detail, see: on-demand paged loading of models in vllm

Also, @Narsil, if you are the maintainer of safetensor, I am a super fan. Very nicely done. Now if we could only eject the 6 other serialization approaches we have in vllm ;-)

Narsil added the wontfix This will not be worked on label Jul 31, 2024

stale bot removed the wontfix This will not be worked on label Jul 31, 2024

Narsil added the wontfix This will not be worked on label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong device when using device="cpu" with torch.device #503

Wrong device when using device="cpu" with torch.device #503

chrisdesa commented Jul 24, 2024

Narsil commented Jul 31, 2024 •

edited

Loading

sdake commented Sep 2, 2024

sdake commented Sep 2, 2024 •

edited

Loading

Wrong device when using device="cpu" with torch.device #503

Wrong device when using device="cpu" with torch.device #503

Comments

chrisdesa commented Jul 24, 2024

System Info

Information

Reproduction

Expected behavior

Narsil commented Jul 31, 2024 • edited Loading

sdake commented Sep 2, 2024

sdake commented Sep 2, 2024 • edited Loading

Narsil commented Jul 31, 2024 •

edited

Loading

sdake commented Sep 2, 2024 •

edited

Loading