Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed initialization of DNDarrays communicator in some routines #1075

Conversation

AsRaNi1
Copy link
Contributor

@AsRaNi1 AsRaNi1 commented Jan 16, 2023

Description

Previously some HeAT routines, e.g. matmul, the initialization of new DNDarrays (usually the output of the routine) was done without specifying a communicator for the new DNDarray properly.
In matmul the output array c = a @ b is initialized as

c = factories.zeros(c_shape, split=a.split, dtype=c_type, device=a.device)

I have added comm=a.comm here and at other comparable places
Issue/s resolved: #1074

Changes proposed:

Previously gave error for the code snippet in #1074

Type of change

Bug fix

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Title of PR is suitable for corresponding CHANGELOG entry

Does this change modify the behaviour of other functions? If so, which?

no

@AsRaNi1
Copy link
Contributor Author

AsRaNi1 commented Jan 18, 2023

@ClaudiaComito I am not understanding how to initialise codecov

Markus-Goetz
Markus-Goetz previously approved these changes Jan 23, 2023
@Markus-Goetz
Copy link
Member

@AsRaNi1 thank you for the addition to the code base. Do you mind fixing the merge issues?

@ClaudiaComito
Copy link
Contributor

@AsRaNi1 indeed, thank you so much.

We have been discussing the issue of activating codecov for external/first-time contributors, @mtar is looking into it.

In the meantime:

  • There are many more instances in the codebase where factories.array() is initiated without specifying the communicator. I know the Issue didn't mention those, but would you consider fixing those as well? It would be awesome.

  • how do we test this fix? @mrfh92 @Markus-Goetz

Thanks all!

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 23, 2023

  • As a further examples: I think at percentile and resplit there is the same problem
  • I think testing is not difficult (from an conceptual point of view): in principle, it suffices to test any routine that has been tested on the default communicator MPI.COMM_WORLD on a subcommunicator as well. Unfortunately, this will roughly double the time the CI needs if one just adds a second test on roughly "half of MPI.COMM_WORLD" (or sth similar) ... otherwise one could just replace the tests on MPI.COMM_WORLD by tests on a subcommunicator only. The latter should suffice to catch all potential errors but I dont like the idea of not testing the most important case explicitely...

@ClaudiaComito
Copy link
Contributor

  • As a further examples: I think at percentile and resplit there is the same problem

    • I think testing is not difficult (from an conceptual point of view): in principle, it suffices to test any routine that has been tested on the default communicator MPI.COMM_WORLD on a subcommunicator as well. Unfortunately, this will roughly double the time the CI needs if one just adds a second test on roughly "half of MPI.COMM_WORLD" (or sth similar) ... otherwise one could just replace the tests on MPI.COMM_WORLD by tests on a subcommunicator only. The latter should suffice to catch all potential errors but I dont like the idea of not testing the most important case explicitely...

I was more thinking of a test in which some operation is performed on a subset of the available processes.

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 24, 2023

This is exactly what I meant, but maybe I expressed it a bit lengthy and unclear

@ClaudiaComito
Copy link
Contributor

This is exactly what I meant, but maybe I expressed it a bit lengthy and unclear

No problem, my main point is we don't need to test this on every operation. For example, in the test_communication module we could have a test_subcomm function that runs a matrix multiplication on 2 processes only.

@AsRaNi1
Copy link
Contributor Author

AsRaNi1 commented Jan 24, 2023

@AsRaNi1 indeed, thank you so much.

We have been discussing the issue of activating codecov for external/first-time contributors, @mtar is looking into it.

In the meantime:

  • There are many more instances in the codebase where factories.array() is initiated without specifying the communicator. I know the Issue didn't mention those, but would you consider fixing those as well? It would be awesome.
  • how do we test this fix? @mrfh92 @Markus-Goetz

Thanks all!

Yes sure!
I'll look into it
Thank you!

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@AsRaNi1
Copy link
Contributor Author

AsRaNi1 commented Feb 4, 2023

@Markus-Goetz please review the following PR

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
Markus-Goetz
Markus-Goetz previously approved these changes Feb 6, 2023
…y-initialized-when-creating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@AsRaNi1
Copy link
Contributor Author

AsRaNi1 commented Feb 13, 2023

heat/nn/data_parallel.py:244 in private method `_nonblocking_hook`:
        D202: No blank lines allowed after function docstring (found 1)
heat/nn/data_parallel.py:282 in private method `_forward_hook`:
        D202: No blank lines allowed after function docstring (found 1)

The following error is coming in pre-commit.ci while testing
I tried to make the changes in data_parallel.py but black reformats the file then pydocstyle throws the error

@ClaudiaComito
Copy link
Contributor

heat/nn/data_parallel.py:244 in private method `_nonblocking_hook`:
        D202: No blank lines allowed after function docstring (found 1)
heat/nn/data_parallel.py:282 in private method `_forward_hook`:
        D202: No blank lines allowed after function docstring (found 1)

The following error is coming in pre-commit.ci while testing I tried to make the changes in data_parallel.py but black reformats the file then pydocstyle throws the error

@AsRaNi1 #1103 is in the queue and will be merged into main when ready. Please update your branch afterwards and at least the pre-commit failure should be fixed. Thanks again @Mystic-Slice !

…y-initialized-when-creating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@AsRaNi1
Copy link
Contributor Author

AsRaNi1 commented Feb 15, 2023

heat/nn/data_parallel.py:244 in private method `_nonblocking_hook`:
        D202: No blank lines allowed after function docstring (found 1)
heat/nn/data_parallel.py:282 in private method `_forward_hook`:
        D202: No blank lines allowed after function docstring (found 1)

The following error is coming in pre-commit.ci while testing I tried to make the changes in data_parallel.py but black reformats the file then pydocstyle throws the error

@AsRaNi1 #1103 is in the queue and will be merged into main when ready. Please update your branch afterwards and at least the pre-commit failure should be fixed. Thanks again @Mystic-Slice !

I have updated the branch

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Mar 27, 2023

Unfortunately, the detailed results of the CI are not available to externals, so I post the result here:
The CI found an error, most likely in resplit (?):

>           new_arr = factories.array(
                gathered, is_split=axis, device=arr.device, arr=arr.comm, dtype=arr.dtype
            )
E           TypeError: array() got an unexpected keyword argument 'arr'
heat/core/manipulations.py:3387: TypeError

@ClaudiaComito ClaudiaComito added this to the 1.3.0 milestone Mar 29, 2023
…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

@AsRaNi1
Copy link
Contributor Author

AsRaNi1 commented Mar 30, 2023

Unfortunately, the detailed results of the CI are not available to externals, so I post the result here: The CI found an error, most likely in resplit (?):

>           new_arr = factories.array(
                gathered, is_split=axis, device=arr.device, arr=arr.comm, dtype=arr.dtype
            )
E           TypeError: array() got an unexpected keyword argument 'arr'
heat/core/manipulations.py:3387: TypeError

I'll make the changes!

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

changed `arr=arr.comm` to `comm=arr.comm`
@github-actions
Copy link
Contributor

Thank you for the PR!

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented May 22, 2023

@mtar The CI failed for AMD and CUDA and actually I dont know whats the problem because no heat-related errors are reported... is it something regarding the CI?

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

…reating-new-DNDarrays-in-some-routines/1074-my-bug-fix
@github-actions
Copy link
Contributor

Thank you for the PR!

@mtar
Copy link
Collaborator

mtar commented May 24, 2023

@mtar The CI failed for AMD and CUDA and actually I dont know whats the problem because no heat-related errors are reported... is it something regarding the CI?

fixed

Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, the changes are fine. Therefore I recommend merging, since the CI runs through as well.

@mrfh92
Copy link
Collaborator

mrfh92 commented May 25, 2023

@ClaudiaComito From my point of view, this PR is ready to merge. Do you agree?

@ClaudiaComito ClaudiaComito changed the title Fixed initialization of DNDarrays in some routines Fixed initialization of DNDarrays communicator in some routines May 25, 2023
@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented May 25, 2023

@ClaudiaComito From my point of view, this PR is ready to merge. Do you agree?

@mrfh92 Yes, please go ahead as soon as the CI is through. Congrats @AsRaNi1 and thanks for your contribution!

@mrfh92 mrfh92 merged commit 5fd3af5 into helmholtz-analytics:main May 25, 2023
@mrfh92
Copy link
Collaborator

mrfh92 commented May 25, 2023

@AsRaNi1 Thanks for your work! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Communicator not properly initialized when creating new DNDarrays in some routines
5 participants