Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong device when using device="cpu" with torch.device #503

Open
1 of 2 tasks
chrisdesa opened this issue Jul 24, 2024 · 3 comments
Open
1 of 2 tasks

Wrong device when using device="cpu" with torch.device #503

chrisdesa opened this issue Jul 24, 2024 · 3 comments
Labels
wontfix This will not be worked on

Comments

@chrisdesa
Copy link

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.43.1
  • Platform: Linux-5.15.0-112-generic-x86_64-with-glibc2.35
  • Python version: 3.12.2
  • Huggingface_hub version: 0.23.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0.dev20240513+cu121 (True)
  • GPU type: NVIDIA H100 80GB HBM3

Information

  • The official example scripts
  • My own modified scripts

Reproduction

with torch.device('cuda:0'):
    safetensors.torch.load_file('filename.safetensors',device='cpu')['key'].device

Expected behavior

The expected behavior is for the device='cpu' argument to override the torch default and load the tensors on the CPU, but the actual behavior is it loads the tensor onto the 'cuda:0' GPU. The device='cpu' argument seems to be interpreted as "whatever the default device is right now" and not as the actual CPU.

@Narsil
Copy link
Collaborator

Narsil commented Jul 31, 2024

I understand the concern, however I do not think this should be fixed.

with torch.device(x)
Changes the default allocation place, which messes a lot of internal logic in safetensors (it doesn't make it wrong it makes it highly inefficient, since your first allocating on device before moving back to the CPU which about the slowest implementation possible).

The behavior is changeable, by being extremely defensive about those global modifiers that are the context managers. However as a general rule, I'd avoid using any rules depending on torch internals, since they tend to change a lot (this context manager didn't exist when this library was created).

Also, in the future, there might be ways to skip CPU allocation entirely meaning the internals would have to be rewritten again, introducing potential breaking changes (which is highly avoided in this library).

Given all these considerations, I'm marking this as wontfix, and instead encourage you and other users to not use contradicting device locations from a context manager.

So:

Either

with torch.device("cuda:0"):
    # Fix the context manager, that make everything much faster.
    with torch.device("cpu"):
      weights = load_file(filename)

Or

# Remove the context manager.
weights = load_file(filename, device="cpu")

Just to note that

with torch.device("cuda:0"):
   weights = load_file(filename)

is totally valid intent behavior and will work transparently currently (contrary to what would happen should we be defensive).

@Narsil Narsil added the wontfix This will not be worked on label Jul 31, 2024
@stale stale bot removed the wontfix This will not be worked on label Jul 31, 2024
@Narsil Narsil added the wontfix This will not be worked on label Aug 8, 2024
@sdake
Copy link

sdake commented Sep 2, 2024

I agree with you broken designs (context managers have all kinds of problems, and we are inventing high performance compute here, where cycles matter), so its better to say no.

How viable would a deprecation warning be? If context manager -> poof.

@sdake
Copy link

sdake commented Sep 2, 2024

@chrisdesa May I interpret what is going on here? I believe your assertion as to what is happening isn't quite right.

Here are the steps:

  1. load_file is set to load to cpu.
  2. Because of legacy pytorch behaviors, the CPU copies the model from the mmap() to a newly allocated malloc().
  3. the context manager then executes a Tensor.to() with the desired GPU target.
  4. This causes the CPU to copy the Tensor to the GPU using a host to device copy.

One of the problems with this flow is that the CPU is entirely responsible for pushing data around. Another problem is that the CPU is executing the memcpy() operation, which is very expensive. Cuda is a client-server system (The client is the cuda API, the server is the kernel driver, firmware, and parts of the chip control plane). It is possible to offload the memory copy from the cpu to various DMA engines. As an example, it would be trivial to execute a host to device copy async in a CUDA stream (a stream is a client/server instance). The CudaMemcpyAsync() API to ask the GPU hardware (using DMA) to copy the Tensors to the GPU.

There are more advanced techniques that could involve no host memory buffer copies at all (why copy, when you could map instead?). For more detail, see: on-demand paged loading of models in vllm

Also, @Narsil, if you are the maintainer of safetensor, I am a super fan. Very nicely done. Now if we could only eject the 6 other serialization approaches we have in vllm ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants