Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to make a ROCm big kernel reproducer #1477

Closed
zingale opened this issue Feb 11, 2024 · 8 comments
Closed

try to make a ROCm big kernel reproducer #1477

zingale opened this issue Feb 11, 2024 · 8 comments

Comments

@zingale
Copy link
Member

zingale commented Feb 11, 2024

ROCm seems to have trouble with large kernels, leading to memory issues. We can try to create a reproducer using test_react, starting with a small net and make bigger and bigger nets (via pynucastro) until we find a size that breaks things. We might also be able to strip out neutrinos, the EOS, and other bits.

@BenWibking
Copy link
Collaborator

Just for future reference: this appears to depend on whether we run the primordial_chem network from the application code or standalone (https://github.com/AMReX-Astro/Microphysics/tree/main/unit_test/burn_cell_primordial_chem). The latter does not trigger the memory fault we see when running the network in Quokka.

@zingale
Copy link
Member Author

zingale commented Feb 11, 2024

we should try test_react instead of burn_cell and also make the box bigger. Right now it seems that n_cell = 16, so we should do 32**3 or 64**3

@yut23
Copy link
Collaborator

yut23 commented Feb 20, 2024

Turning off force-inlining of all functions in the kernel appears to fix the memory issues in Castro. We could try something like constexpr_for<0, N>([](int) { big_function(); }); to make an arbitrarily-large kernel.

@BenWibking
Copy link
Collaborator

@psharda found that turning off force-inlining also fixes the memory issues in Quokka.

@BenWibking
Copy link
Collaborator

More context on the compiler bug here: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811

TL;DR the underlying issue is well-understood by the compiler developers, and it is supposed to be fixed by this LLVM PR: llvm/llvm-project#93526

Should we close this issue?

@zingale
Copy link
Member Author

zingale commented Aug 24, 2024

Very nice. Does this mean a future ROCm version will have this fix?

@BenWibking
Copy link
Collaborator

Since the ROCm compiler is derived from the upstream LLVM sources, I think, in principle, yes. No idea when this will be.

@zingale zingale closed this as completed Aug 25, 2024
@zingale
Copy link
Member Author

zingale commented Aug 25, 2024

closing this since it seems to be recognized to be an LLVM bug with a PR fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants