Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Parthenon to latest develop (2024-02-15) #84

Merged
merged 15 commits into from
Feb 21, 2024

Conversation

forrestglines
Copy link
Contributor

Update Parthenon to PR930

@forrestglines
Copy link
Contributor Author

@par-hermes format

@forrestglines
Copy link
Contributor Author

Actually this causing issues for the cluster set up, do no merge

@forrestglines forrestglines changed the title Updated Parthenon to PR930 WIP:Updated Parthenon to PR930 Aug 30, 2023
@pgrete
Copy link
Contributor

pgrete commented Sep 26, 2023

FYI I just bumped to current develop in Parthenon. Let's see how this works.

@BenWibking
Copy link
Contributor

I'm a bit confused as to why this is merging into the donut branch. The main branch doesn't have these changes either, does it?

Is this PR still relevant, or has it been superceded?

@pgrete
Copy link
Contributor

pgrete commented Feb 15, 2024

I'm a bit confused as to why this is merging into the donut branch. The main branch doesn't have these changes either, does it?

Is this PR still relevant, or has it been superceded?

Yes, the PR is still relevant and I was putting it on hold while tracing back the issue I saw on Lumi.
I ran further tests earlier today and while I'm reliably able to reproduce errors like

cycle=0 time=0.0000000000000000e+00 dt=3.5156250000000001e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=2.43e-02 wsec_step=8.55e-01
Memory access fault by GPU node-5 (Agent handle: 0x82126c0) on address 0x14e1b1300000. Reason: Unknown.
srun: error: nid005016: task 1: Aborted

with the Parthenon advection example on just two ranks (not showing up on a single ranks), I was neither able to reproduce locally nor on Frontier.
Moreover, the core dump points to some /opt/rocm/lib/libhsa-runtime64.so.1 libs.
This gives me reason to believe that there's an issue with the software stack on Lumi and that's it not an issue on the Parthenon side.

Bottom line: Let me clean up this PR (tomorrow) and then we can merge with the latest Parthenon develop in place.
Does this sound like a plan?

@BenWibking
Copy link
Contributor

I think the core dump will point to that because that's where the error handler is, not necessarily because the underlying bug is there.

(We've been debugging an issue in Quokka that produces that error message. We thought it was a out-of-bounds memory access, which it usually is. However, It turned out to be a compiler bug that appears for specific ROCm versions 🙃)

Yes, that plan sounds good to me.

@BenWibking
Copy link
Contributor

@pgrete just want to check if this is ready to go?

@pgrete pgrete changed the base branch from pgrete/donut to main February 19, 2024 02:03
@pgrete pgrete changed the title WIP:Updated Parthenon to PR930 Bump Parthenon to latest develop (2024-02-15) Feb 19, 2024
@pgrete
Copy link
Contributor

pgrete commented Feb 19, 2024

@pgrete just want to check if this is ready to go?

I double checked on Friday that the (other) crashes I observed are not reproducible any more. They were also not reproducible with the binary I triggered them in first place and given that those were all interconnect related issue, I assume that the machine was unstable by chance when I ran the original tests. (Note this was a second error mode apart from the Memory access fault by GPU above).

I now updated to latest Parthenon develop and fixed all interface changes fingerscrossed.
I also took this as opportunity to introduce a CHANGELOG (where I highlighted the changes in the Parthenon API).

Bottom line: this is ready for review and merge @BenWibking @forrestglines

Copy link
Contributor

@BenWibking BenWibking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be some unrelated cluster setup changes (although they are trivial). Everything else looks good to me.

@pgrete
Copy link
Contributor

pgrete commented Feb 19, 2024

There appears to be some unrelated cluster setup changes (although they are trivial). Everything else looks good to me.

Yes, I added those (to be used for the histograms).
In general, I'd like to separate the "derived fields" into a more central location as many fields are reusable across problem generators.

@BenWibking
Copy link
Contributor

@forrestglines Did the cluster test runs with this PR work?

@forrestglines
Copy link
Contributor Author

forrestglines commented Feb 21, 2024

@forrestglines Did the cluster test runs with this PR work?

Yes -- a Perseus test cluster did not cause any issues when I think it did before. I think this is safe to merge in

@BenWibking
Copy link
Contributor

@pgrete do you want to merge this now?

@BenWibking BenWibking merged commit 3cc587a into main Feb 21, 2024
4 checks passed
@pgrete pgrete deleted the forrestglines/parthenon-pr930 branch March 18, 2024 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants