Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix edge-case contiguity mismatch for Allgatherv #1058

Merged
merged 27 commits into from
Jan 19, 2023

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Dec 12, 2022

Description

Function communication.mpi_type_and_elements_of() calculates type and number of elements that will be sent/received in an Allgatherv call. How to calculate the number of elements depends on whether the input object (most likely a torch Tensor) is contiguous.

In some edge cases, i.e. in case of singleton split dimension, torch.Tensor.is_contiguous() might return True on a process while it is False on others. This result in a mismatch of the send/recv elements among processes in Allgatherv and resulting deadlock (see #1057 ).

Issue/s resolved: #1057

Changes proposed:

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Memory requirements

NA

Performance

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Title of PR is suitable for corresponding CHANGELOG entry

Does this change modify the behaviour of other functions? If so, which?

no

@ghost
Copy link

ghost commented Dec 12, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@ClaudiaComito ClaudiaComito added communication bug Something isn't working labels Dec 12, 2022
@ClaudiaComito ClaudiaComito added this to the 1.2.2 milestone Dec 12, 2022
@codecov
Copy link

codecov bot commented Dec 23, 2022

Codecov Report

Merging #1058 (d62a64d) into release/1.2.x (6158fa9) will increase coverage by 0.03%.
The diff coverage is 94.28%.

@@                Coverage Diff                @@
##           release/1.2.x    #1058      +/-   ##
=================================================
+ Coverage          91.76%   91.80%   +0.03%     
=================================================
  Files                 65       65              
  Lines              10024    10075      +51     
=================================================
+ Hits                9199     9249      +50     
- Misses               825      826       +1     
Flag Coverage Δ
unit 91.80% <94.28%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
heat/core/linalg/basics.py 95.44% <85.71%> (-0.09%) ⬇️
heat/core/communication.py 96.23% <100.00%> (+0.02%) ⬆️
heat/core/dndarray.py 96.79% <100.00%> (+0.01%) ⬆️
heat/core/logical.py 100.00% <100.00%> (ø)
heat/core/manipulations.py 98.63% <100.00%> (-0.01%) ⬇️
heat/core/tests/test_suites/basic_test.py 98.07% <100.00%> (+0.01%) ⬆️
heat/core/constants.py 100.00% <0.00%> (ø)
heat/core/trigonometrics.py 100.00% <0.00%> (ø)
heat/core/random.py 99.67% <0.00%> (+<0.01%) ⬆️
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ClaudiaComito ClaudiaComito marked this pull request as ready for review December 23, 2022 12:55
Copy link
Collaborator

@mtar mtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ClaudiaComito

You added extra lines for arrays with different splits when the number of MPI processes is one. It doesn't make much sense to pass an argument when the split won't take place on one process. What do you think about disallowing splits or automatically setting he value to None when only a single process is involved at array creation time? It would save us some tests/checks.

heat/core/communication.py Outdated Show resolved Hide resolved
heat/core/communication.py Outdated Show resolved Hide resolved
heat/core/communication.py Outdated Show resolved Hide resolved
# simple case, contiguous memory can be transmitted as is
if is_contiguous is None:
# determine local contiguity
is_contiguous = obj.is_contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the value is different on the processes. How likely is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mtar, that's a great question. The obvious case in which this might happen is the permutation, and this is dealt with in this PR. Outside of that, we're simply falling back to the previous implementation.

I could add a global check that sets is_contiguous to False if the local contiguities are dishomogeneous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ClaudiaComito

You added extra lines for arrays with different splits when the number of MPI processes is one. It doesn't make much sense to pass an argument when the split won't take place on one process. What do you think about disallowing splits or automatically setting he value to None when only a single process is involved at array creation time? It would save us some tests/checks.

This is a general discussion worth having, maybe not re: this bug fix.
My main argument against setting all splits to None when running on 1 MPI process, is that it will be confusing for users while they are testing their code (potentially on 1 process or even interactively).

Anyway, let's discuss it in a separate Issue.

As far as I'm concerned, I'm done with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mtar, that's a great question. The obvious case in which this might happen is the permutation, and this is dealt with in this PR. Outside of that, we're simply falling back to the previous implementation.

I could add a global check that sets is_contiguous to False if the local contiguities are dishomogeneous.

I've decided not to add (yet another) global check for contiguous status for now, as I can't think of the appropriate edge-case to test it. We are already testing for column-first memory layout operations. If anybody can think of something, let me know.

heat/core/linalg/tests/test_basics.py Outdated Show resolved Hide resolved
ClaudiaComito and others added 5 commits January 10, 2023 07:28
Co-authored-by: mtar <m.tarnawa@fz-juelich.de>
Co-authored-by: mtar <m.tarnawa@fz-juelich.de>
Co-authored-by: mtar <m.tarnawa@fz-juelich.de>
@ClaudiaComito ClaudiaComito merged commit 73e6204 into release/1.2.x Jan 19, 2023
@mtar mtar deleted the bugs/#1057-Allgatherv-contiguity-mismatch branch February 28, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working communication
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants