Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PT: display_if_exist is blocking #3991

Closed
njzjz opened this issue Jul 17, 2024 · 0 comments · Fixed by #3992
Closed

[BUG] PT: display_if_exist is blocking #3991

njzjz opened this issue Jul 17, 2024 · 0 comments · Fixed by #3992
Labels

Comments

@njzjz
Copy link
Member

njzjz commented Jul 17, 2024

Bug summary

The profiler shows that cudaStreamSynchronize happens in display_if_exist.

image

In display_if_exist, find_property is expected to be float. However, it is tensor(1., device='cuda:0'), a float32 tensor on the GPU, causing the synchronization.

@staticmethod
def display_if_exist(loss: torch.Tensor, find_property: float) -> torch.Tensor:
"""Display NaN if labeled property is not found.
Parameters
----------
loss : torch.Tensor
the loss tensor
find_property : float
whether the property is found
"""
return loss if bool(find_property) else torch.nan

DeePMD-kit Version

0c0878e

Backend and its version

PyTorch 2.3.1

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Use examples/water/se_atten_compressible to debug.

Steps to Reproduce

cd examples/water/se_atten_compressible
dp --pt train input.json

Further Information, Files, and Links

No response

@njzjz njzjz added the bug label Jul 17, 2024
@njzjz njzjz changed the title [BUG] display_if_exist is blocking [BUG] PT: display_if_exist is blocking Jul 17, 2024
@njzjz njzjz linked a pull request Jul 18, 2024 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Jul 18, 2024
make 'find_' to be float in get data, fix #3991 .

On my device, the profiler indicates that `cudaStreamSynchronize` takes
negligible time, resulting in minimal speedup.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced data loading by adding a `collate_fn` parameter for more
flexible data collation.
- Improved data filtering by excluding keys containing "find_" in
addition to existing filters.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@njzjz njzjz closed this as completed Jul 18, 2024
mtaillefumier pushed a commit to mtaillefumier/deepmd-kit that referenced this issue Sep 18, 2024
make 'find_' to be float in get data, fix deepmodeling#3991 .

On my device, the profiler indicates that `cudaStreamSynchronize` takes
negligible time, resulting in minimal speedup.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced data loading by adding a `collate_fn` parameter for more
flexible data collation.
- Improved data filtering by excluding keys containing "find_" in
addition to existing filters.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant