-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix maskhalo bug in dynamics #517
Conversation
… decompositions. In CICE-Consortium#491, an unmasked halo update was changed to a masked halo update. This affects only padded decompositions with maskhalo_dyn=true and was picked up by an exact restart failure
It would be nice to get this reviewed and approved today to get it into weekend testing. But this bug is unlikely to affect anyone in the short term, so it can also wait if needed. |
Thanks @apcraig. This looks like the right solution, but I'm not sure I understand why the change from an unmasked to a masked halo at the point in code would cause exact restart to fail... Can you explain a little more ? just so I understand the code better. The original commit where I made this change was: e85bca8, and I remember thinking "oh, the masked haloupdate is not done there, with my change it will be done but I guess that's an optimization..." so something was clearly missing from my understanding. |
@phil-blain, I was sort of asking myself the same question when I identified the difference. I can't remember the details about why that first halo update should not be masked. It's obviously a subtle issue that seems to only create problems is very special cases. The other thing is that haloupdate is not part of the subcycling (in evp and eap anyway), so the cost savings is relatively small. If we want to dig deeper into this, we could. It would probably require comparing various fields within the subroutine as the model advances to see where and how the differences are introduced. I'm not sure it's worth the effort right now, but am open to raising the priority and pursuing this question. When I implemented the maskhalo feature, I tried to identify the proper mask and test for bit-for-bit focusing especially where a mask could be reused for multiple halo updates. Since this was a single halo update outside the dynamics subcycling, the cost benefit was relatively small. I suspect I recognized the masked halo was a problem in this haloupdate, but I may not have ever tried to understand why because the performance implication was small. It may be as simple as forcing zeros into the unused halo to ensure it behaves properly, but that is just speculation. Again, we could look into this more if there was interest. |
OK I understand. Let's not pursue this too far, we can always revisit if need be, "premature optimization is the root of all evil" after all. |
I don't completely understand this either, but here's a hypothesis: |
PR checklist
Fix maskhalo bug in dynamics introduced in dynamics: add implicit VP solver #491
apcraig
Full test suite passed on cheyenne, this fixes issue with "restart gx3 8x2x8x10x20 droundrobin maskhalo" on the current trunk. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#b1b1f9d49b3a228cff3fc9b07ac4a22d13f2da39
Fix maskhalo bug in dynamics, only appears when running padded decompositions. In #491, an unmasked halo update was changed to a masked halo update. This affects only padded decompositions with maskhalo_dyn=true and was picked up by an exact restart failure.