Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADIOS2's MPI communication questions #3741

Closed
liangwang0734 opened this issue Aug 5, 2023 · 8 comments
Closed

ADIOS2's MPI communication questions #3741

liangwang0734 opened this issue Aug 5, 2023 · 8 comments

Comments

@liangwang0734
Copy link

liangwang0734 commented Aug 5, 2023

Dear ADIOS2 devs/community, the following questions are mostly regarding a bug in our user code that uses BP5; they may or may not be related to ADIOS2, but I hope you may kindly offer some thoughts.

First, does ADIOS2 use MPI_COMM_WORLD somewhere during BP5 IO or it always sticks to its own communicator? Does it use the number zero as a tag in many places?

Recently, in our code that uses ADIOS2 BP5 as the output file format, we encountered a bug strange bug.

  • When we use BP5, then we have to change our MPI_Irecv/MPI_Isend's tag in one in one particular part of our user code to a nonzero number, otherwise, we get somewhat random failures in some MPI calls (we have tested mostly with very large numbers or "strange" numbers as the tag; we have also checked that the user codes' Isend/Irecv pairs and data size look OK).
  • When we disable output or use hdf5 for output, then no problem is found.

Though it sounds extremely unlikely, I still would like to get any of your thoughts on possible conflicts between users' MPI calls and the internal communication within ADIOS2/BP5.

Thank you very much.

@pnorbert
Copy link
Contributor

pnorbert commented Aug 5, 2023 via email

@liangwang0734
Copy link
Author

liangwang0734 commented Aug 5, 2023

Hi @pnorbert , thank you for your prompt reply. I will do more investigations following your suggestions.

May I also know where MPI_Irecv/MPI_Isend are called during BP5 output? I tried to add prints in source/adios2/helper/adiosCommMPI.cpp but the Irecv and Isend functions do not seem to be called there.

Also, for a quick check, is there a way to change a "master" value for the tags?

@liangwang0734
Copy link
Author

@pnorbert using NULL engine, the run seems fine

@pnorbert
Copy link
Contributor

pnorbert commented Aug 6, 2023

Interesting. Even though I don't understand why the tag 0 should matter when using different communicators, you may try to change BP5 tags in the Isend/Recv pairs in

a->m_AggregatorChainComm.Recv(

Change the 0 tag in a->m_AggregatorChainComm.Isend(...) and a->m_AggregatorChainComm.Recv(...) calls in this file and rebuild adios. There are 2 instances of each of them in this file. This file corresponds to the default aggregation mode in BP5.

@liangwang0734
Copy link
Author

liangwang0734 commented Aug 6, 2023

Thank you, @pnorbert. I did some tests following your suggestion.

Changing the tags in the various variants of the bp5 writer didn't seem to help.

However, the problem seems to be gone after I change the tag in the Isend/Recv calls in MPIShmChain::HandshakeLinks_Start.

PS: I noticed that ADIOS2's communicator calls have additional "hints". Is there a way to print them?

@liangwang0734
Copy link
Author

@pnorbert I think this is caused by a bug in release_28 that is now fixed! In release_28 of the HankShakeLinks_Start, rank 0 had the incorrect origin for Recv.

@pnorbert
Copy link
Contributor

pnorbert commented Aug 6, 2023 via email

@liangwang0734
Copy link
Author

We have been using v2.8 since some preliminary tests with v2.9 failed. But we do plan to turn to using v2.9 when we have time to investigate the causes.

Thank you very much for the useful suggestions! I'm going to close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants