Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed unfold operation #1419

Conversation

FOsterfeld
Copy link
Member

@FOsterfeld FOsterfeld commented Apr 2, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • documentation updated where needed

Description

Add the function unfold to the available manipulations. unfold(a, dimension, size, step) for a DNDarray a behaves like torch.Tensor.unfold.

Example:

>>> x = ht.arange(1., 8)
>>> x
DNDarray([1., 2., 3., 4., 5., 6., 7.], dtype=ht.float32, device=cpu:0, split=e)
>>> ht.unfold(x, 0, 2, 1)
DNDarray([[1., 2.],
          [2., 3.],
          [3., 4.],
          [4., 5.],
          [5., 6.],
          [6., 7.]], dtype=ht.float32, device=cpu:0, split=None)
>>> ht.unfold(x, 0, 2, 2)
DNDarray([[1., 2.],
          [3., 4.],
          [5., 6.]], dtype=ht.float32, device=cpu:0, split=None)

Issue/s resolved: #1400

Changes proposed:

Type of change

  • New feature (non-breaking change which adds functionality)

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

no

Copy link
Contributor

github-actions bot commented Apr 2, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 2, 2024

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Apr 2, 2024

Thank you for the PR!

Copy link

codecov bot commented Apr 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.07%. Comparing base (ef97474) to head (c03db4c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1419      +/-   ##
==========================================
+ Coverage   92.04%   92.07%   +0.02%     
==========================================
  Files          83       83              
  Lines       12113    12144      +31     
==========================================
+ Hits        11150    11181      +31     
  Misses        963      963              
Flag Coverage Δ
unit 92.07% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 3, 2024

The tests on the CUDA-runner seem to hang at test_manipulations.py for 5 MPI-processes.
This also happens locally on my machine, so there seems to be an error in unfold that results in hanging (most likely an MPI deadlock?)

Copy link
Contributor

github-actions bot commented Apr 3, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 8, 2024

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Apr 15, 2024

On the Terrabyte cluster, using 8 processes on 2 nodes with 4 GPUs each I get the following error:

ERROR: test_unfold (heat.core.tests.test_manipulations.TestManipulations)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dss/dsshome1/03/di93zek/heat/heat/core/tests/test_manipulations.py", line 3775, in test_unfold
    ht.unfold(x, 0, min_chunk_size, min_chunk_size + 1)  # no fully local unfolds on some nodes
  File "/dss/dsshome1/03/di93zek/heat/heat/core/manipulations.py", line 4272, in unfold
    ret_larray = torch.cat((unfold_loc, unfold_halo), dimension)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

----------------------------------------------------------------------
Ran 32 tests in 26.574s

on CPU, everything seems to work (at least in test_manipulations.py)

@FOsterfeld
Copy link
Member Author

@mrfh92 I have now added the error for the case that size=1. I could also verify that the synchronization errors that caused data corruption do not occur anymore, so this PR should be ready for merging.

Undid my stupid change before that belongs to another issue
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Aug 13, 2024

@FOsterfeld from my point of view this now looks fine
@ClaudiaComito do you agree?

mrfh92
mrfh92 previously approved these changes Aug 13, 2024
Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine from my point of view. @FOsterfeld Thanks 👍

heat/core/dndarray.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FOsterfeld @mrfh92 this looks great, I only found some (presumably) dead code that can be removed, otherwise I think it can be merged. Thanks a lot!

heat/core/tests/test_manipulations.py Outdated Show resolved Hide resolved
heat/core/dndarray.py Outdated Show resolved Hide resolved
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @FOsterfeld !

@ClaudiaComito ClaudiaComito added merge queue enhancement New feature or request labels Aug 19, 2024
Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito changed the title Features/1400 implement unfold operation similar to torch tensor unfold Implement distributed unfold operation Aug 19, 2024
Copy link
Contributor

Thank you for the PR!

@mtar mtar merged commit 2ecf597 into main Aug 19, 2024
9 checks passed
@mtar mtar deleted the features/1400-Implement_unfold-operation_similar_to_torch_Tensor_unfold branch August 19, 2024 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement unfold-operation similar to torch.Tensor.unfold
4 participants