Fix edge-case contiguity mismatch for Allgatherv #1058

ClaudiaComito · 2022-12-12T10:51:04Z

Description

Function communication.mpi_type_and_elements_of() calculates type and number of elements that will be sent/received in an Allgatherv call. How to calculate the number of elements depends on whether the input object (most likely a torch Tensor) is contiguous.

In some edge cases, i.e. in case of singleton split dimension, torch.Tensor.is_contiguous() might return True on a process while it is False on others. This result in a mismatch of the send/recv elements among processes in Allgatherv and resulting deadlock (see #1057 ).

Issue/s resolved: #1057

Changes proposed:

Introduce global is_contiguous boolean as kwarg for communication.as_buffer() and communication.mpi_type_and_elements_of(). It is set to False on all processes if object dimensions have been permuted, independently on the size of the dimension
More deadlocks/errors came up running the tests, solved here:
- removed unnecessary MPI calls for non-distributed cases in dndarray.resplit_(), manipulations.resplit(), linalg.matmul(), and tests
- in test_suites.basic_tests.TestCase.assert_array_equal, local tensors are compared to the relevant slices of the numpy reference array, instead of gathering the distributed DNDarray every time. TODO: Same should be implemented in Avoid unnecessary gathering of distributed operand in mixed distributed/non-distributed logical functions #1064
- ht.allclose now works on operands with different dtype as well. Related to ht.allclose should factor in the dtype of the inputs when determining the limits #889

Type of change

Bug fix (non-breaking change which fixes an issue)

Memory requirements

NA

Performance

Due Diligence

All split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Title of PR is suitable for corresponding CHANGELOG entry

Does this change modify the behaviour of other functions? If so, which?

no

ghost · 2022-12-12T10:52:23Z

👇 Click on the image for a new way to code review

Make big changes easier — review code in small groups of related files
Know where to start — see the whole change at a glance
Take a code tour — explore the change with an interactive tour
Make comments and review — all fully sync’ed with github

Try it now!

Legend

for more information, see https://pre-commit.ci

codecov · 2022-12-23T12:07:48Z

Codecov Report

Merging #1058 (d62a64d) into release/1.2.x (6158fa9) will increase coverage by 0.03%.
The diff coverage is 94.28%.

@@                Coverage Diff                @@
##           release/1.2.x    #1058      +/-   ##
=================================================
+ Coverage          91.76%   91.80%   +0.03%     
=================================================
  Files                 65       65              
  Lines              10024    10075      +51     
=================================================
+ Hits                9199     9249      +50     
- Misses               825      826       +1

Flag	Coverage Δ
unit	`91.80% <94.28%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
heat/core/linalg/basics.py	`95.44% <85.71%> (-0.09%)`	⬇️
heat/core/communication.py	`96.23% <100.00%> (+0.02%)`	⬆️
heat/core/dndarray.py	`96.79% <100.00%> (+0.01%)`	⬆️
heat/core/logical.py	`100.00% <100.00%> (ø)`
heat/core/manipulations.py	`98.63% <100.00%> (-0.01%)`	⬇️
heat/core/tests/test_suites/basic_test.py	`98.07% <100.00%> (+0.01%)`	⬆️
heat/core/constants.py	`100.00% <0.00%> (ø)`
heat/core/trigonometrics.py	`100.00% <0.00%> (ø)`
heat/core/random.py	`99.67% <0.00%> (+<0.01%)`	⬆️
... and 4 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mtar

Thank you @ClaudiaComito

You added extra lines for arrays with different splits when the number of MPI processes is one. It doesn't make much sense to pass an argument when the split won't take place on one process. What do you think about disallowing splits or automatically setting he value to None when only a single process is involved at array creation time? It would save us some tests/checks.

heat/core/communication.py

mtar · 2023-01-04T16:25:39Z

heat/core/communication.py

+        # simple case, contiguous memory can be transmitted as is
+        if is_contiguous is None:
+            # determine local contiguity
+            is_contiguous = obj.is_contiguous()


What happens if the value is different on the processes. How likely is it?

Thanks @mtar, that's a great question. The obvious case in which this might happen is the permutation, and this is dealt with in this PR. Outside of that, we're simply falling back to the previous implementation.

I could add a global check that sets is_contiguous to False if the local contiguities are dishomogeneous.

Thank you @ClaudiaComito

You added extra lines for arrays with different splits when the number of MPI processes is one. It doesn't make much sense to pass an argument when the split won't take place on one process. What do you think about disallowing splits or automatically setting he value to None when only a single process is involved at array creation time? It would save us some tests/checks.

This is a general discussion worth having, maybe not re: this bug fix.
My main argument against setting all splits to None when running on 1 MPI process, is that it will be confusing for users while they are testing their code (potentially on 1 process or even interactively).

Anyway, let's discuss it in a separate Issue.

As far as I'm concerned, I'm done with this PR.

Thanks @mtar, that's a great question. The obvious case in which this might happen is the permutation, and this is dealt with in this PR. Outside of that, we're simply falling back to the previous implementation.

I could add a global check that sets is_contiguous to False if the local contiguities are dishomogeneous.

I've decided not to add (yet another) global check for contiguous status for now, as I can't think of the appropriate edge-case to test it. We are already testing for column-first memory layout operations. If anybody can think of something, let me know.

heat/core/linalg/tests/test_basics.py

Co-authored-by: mtar <m.tarnawa@fz-juelich.de>

Fix edge-case contiguity mismatch for Allgatherv

f46ae67

ClaudiaComito and others added 3 commits December 12, 2022 12:09

merge branch release/1.2.x

4da69fd

Update ubuntu

27ea911

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0fb6c8

for more information, see https://pre-commit.ci

ClaudiaComito added communication bug Something isn't working labels Dec 12, 2022

ClaudiaComito added this to the 1.2.2 milestone Dec 12, 2022

ClaudiaComito added 15 commits December 12, 2022 14:02

switch back to ubuntu 20.04

0e704d4

pull

f5d7850

Upgrade CI to ubuntu 22.04 and cuda 11.7.1

acfe9bd

avoid unnecessary gathering of test DNDarrays

0fd3d87

early out for resplit of non-distributed DNDarrays

3c4c07c

match split of comparison array to expected output

989e0f4

avoid MPI calls in non-distributed cases

6d66fad

avoid MPI calls in non-distributed resplit

a37b4d3

set default to None

8eebe10

remove print statement

22c5c68

upgrade torch version

c692bff

copy to cpu before comparing

df6a4e5

use ht.allclose instead of np.allclose

af0e721

cast different dtype operands to promoted dtype within torch call

bac6d4e

compare local tensors to corresponding slice of expected_array only

c0c6362

ClaudiaComito added 2 commits December 23, 2022 13:38

expand tests

587bc05

remove redundant code

24239a1

ClaudiaComito marked this pull request as ready for review December 23, 2022 12:55

ClaudiaComito requested review from Mystic-Slice, mtar and saisuraj27-729 December 23, 2022 13:01

ClaudiaComito requested a review from shahpratham December 23, 2022 13:01

use pytorch with cuda117 support

38c00a3

mtar reviewed Jan 6, 2023

View reviewed changes

This was referenced Jan 10, 2023

[Bug]: parallel write-out to netcdf fails in tests #1071

Closed

[Bug]: Problem when writing to csv from N>1 nodes #1011

Closed

ClaudiaComito and others added 5 commits January 10, 2023 07:28

[skip ci] Update heat/core/communication.py

5d25588

Co-authored-by: mtar <m.tarnawa@fz-juelich.de>

[skip ci] Update heat/core/communication.py

b382d4b

Co-authored-by: mtar <m.tarnawa@fz-juelich.de>

[skip ci] Update heat/core/communication.py

26d92bf

Co-authored-by: mtar <m.tarnawa@fz-juelich.de>

Remove dead code

79e13e2

Update pytorch-latest.txt

d62a64d

mtar approved these changes Jan 17, 2023

View reviewed changes

ClaudiaComito merged commit 73e6204 into release/1.2.x Jan 19, 2023

ClaudiaComito mentioned this pull request Jan 25, 2023

[Bug]: Deadlocks with ubuntu-latest (OpenMPI>= 4.1.2) #1057

Closed

mtar deleted the bugs/#1057-Allgatherv-contiguity-mismatch branch February 28, 2024 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix edge-case contiguity mismatch for Allgatherv #1058

Fix edge-case contiguity mismatch for Allgatherv #1058

ClaudiaComito commented Dec 12, 2022 •

edited

Loading

ghost commented Dec 12, 2022 •

edited by ghost

Loading

codecov bot commented Dec 23, 2022 •

edited

Loading

mtar left a comment

mtar Jan 4, 2023

ClaudiaComito Jan 10, 2023

ClaudiaComito Jan 15, 2023

ClaudiaComito Jan 15, 2023

Fix edge-case contiguity mismatch for Allgatherv #1058

Fix edge-case contiguity mismatch for Allgatherv #1058

Conversation

ClaudiaComito commented Dec 12, 2022 • edited Loading

Description

Changes proposed:

Type of change

Memory requirements

Performance

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

ghost commented Dec 12, 2022 • edited by ghost Loading

Legend

codecov bot commented Dec 23, 2022 • edited Loading

Codecov Report

mtar left a comment

Choose a reason for hiding this comment

mtar Jan 4, 2023

Choose a reason for hiding this comment

ClaudiaComito Jan 10, 2023

Choose a reason for hiding this comment

ClaudiaComito Jan 15, 2023

Choose a reason for hiding this comment

ClaudiaComito Jan 15, 2023

Choose a reason for hiding this comment

ClaudiaComito commented Dec 12, 2022 •

edited

Loading

ghost commented Dec 12, 2022 •

edited by ghost

Loading

codecov bot commented Dec 23, 2022 •

edited

Loading