Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_max_memory() returns allocated memory for XPU instead of total device memory #2929

Closed
dvrogozh opened this issue Jul 12, 2024 · 5 comments · Fixed by #3275
Closed

get_max_memory() returns allocated memory for XPU instead of total device memory #2929

dvrogozh opened this issue Jul 12, 2024 · 5 comments · Fixed by #3275
Labels
wip Work in progress

Comments

@dvrogozh
Copy link
Contributor

dvrogozh commented Jul 12, 2024

Here:

max_memory[i] = torch.xpu.max_memory_allocated(i)

XPU is queried for the max allocated memory while other devices, for example cuda, is queried for total free memory:

max_memory[i] = torch.npu.mem_get_info(i)[0]

This seems a bug. However, I believe that mem_get_info() is not currently supported by XPU backend in pytorch (as of pytorch/pytorch@3477ee3) and needs to be requested.

I would also like to note that pytorch/pytorch#129919 will provide implementation for torch.xpu.max_memory_allocated(). For me on an idle device it returned 512 bytes which caused an issue running HF models with pipeline(device_map="auto") - model was dispatched to CPU instead of XPU with this printout (see huggingface/transformers#31922 for details):

/home/gta/git/huggingface/accelerate/src/accelerate/utils/modeling.py:1399: UserWarning: Current model requires 4096 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5 @sywangyi @yao-matrix
CC: @muellerzr @SunMarc

@dvrogozh
Copy link
Contributor Author

However, I believe that mem_get_info() is not currently supported by XPU backend in pytorch and needs to be requested.

Filed request in pytorch/pytorch#130599

@SunMarc
Copy link
Member

SunMarc commented Jul 19, 2024

Indeed, thanks for the report ! Keep us updated when this fixed @dvrogozh ! cc @abhilash1910

@abhilash1910
Copy link
Contributor

abhilash1910 commented Jul 19, 2024

Thanks @SunMarc for the ping. I believe that in the existence of XPU, it should trigger 0th device memory params , but I think
that it maybe due to this commit (this was seen before) : 30cb7ec
@faaany could you take a look on this?
I agree with @dvrogozh that mem_get_info() api is needed.

@faaany
Copy link
Contributor

faaany commented Jul 29, 2024

Hi @abhilash1910 , the issue mentioned by @dvrogozh is a known issue. And it is not related to my commit.

@muellerzr muellerzr added the wip Work in progress label Aug 21, 2024
dvrogozh added a commit to dvrogozh/accelerate that referenced this issue Dec 6, 2024
torch.xpu.mem_get_info API is available starting from PyTorch 2.6 (and
in nightly 2.6.0.dev20241206+xpu or later). To work properly this method
requires PyTorch built with the SYCL runtime which supports API to query
device memory stats. If not available, exception will be raised.

Requires: pytorch/pytorch#141230
Fixes: huggingface#2929
Fixes: huggingface/transformers#31922
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
dvrogozh added a commit to dvrogozh/accelerate that referenced this issue Dec 6, 2024
torch.xpu.mem_get_info API is available starting from PyTorch 2.6 (and
in nightly 2.6.0.dev20241206+xpu or later). To work properly this method
requires PyTorch built with the SYCL runtime which supports API to query
device memory stats. If not available, exception will be raised.

Requires: pytorch/pytorch#141230
Fixes: huggingface#2929
Fixes: huggingface/transformers#31922
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
@dvrogozh
Copy link
Contributor Author

dvrogozh commented Dec 6, 2024

torch.xpu.mem_get_info() API has landed in PyTorch this week (thru pytorch/pytorch#141230) making it for PyTorch 2.6 upcoming release. Here is a corresponding fix on Accelerate side which addresses the issue:

For IPEX API became available earlier and Accelerate was already adjusted to cover this case in 4b4c036.

dvrogozh added a commit to dvrogozh/accelerate that referenced this issue Dec 9, 2024
torch.xpu.mem_get_info API is available starting from PyTorch 2.6 (and
in nightly 2.6.0.dev20241206+xpu or later). To work properly this method
requires PyTorch built with the SYCL runtime which supports API to query
device memory stats. If not available, exception will be raised.

Requires: pytorch/pytorch#141230
Fixes: huggingface#2929
Fixes: huggingface/transformers#31922
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip Work in progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants