Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor global sums to improve performance #780

Closed
apcraig opened this issue Oct 27, 2022 · 6 comments · Fixed by #824
Closed

Refactor global sums to improve performance #780

apcraig opened this issue Oct 27, 2022 · 6 comments · Fixed by #824
Assignees

Comments

@apcraig
Copy link
Contributor

apcraig commented Oct 27, 2022

The global sums have traditionally been used only for diagnostics in the log file. The VP solver now also uses global sums, see #774. The CICE generic global sums could be refactored to improve performance, shifting if statements associated with optional arguments relative to loops and similar.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 28, 2023

@phil-blain, I'm doing some timing tests with the global sums including with VP. I'm trying to understand this section of the vp, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3348. The haloupdate is not done when bfbflag='off'. Why is this and is it related to how the computation of the global_sum is done when bfbflag='off' here, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L2556?

The reason I ask is that the performance and results change in rather surprising ways if I comment out one or the other bfbflag='off' logic in VP. One goal is to improve the global sum performance in the infrastructure such that there is no performance degradatation in VP when using it. But I think I also need to understand if/when/why the haloupdate is executed at line 3348 under various settings of the bfbflag. Thanks.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 28, 2023

Some Dynamics timing data from various tests. I'm setting the VP global_sum to "local" or "lsum8", and I'm setting the haloUpdate at line 3348 to "off" or "on" manually via code mods. Here's the timing

sum=local, halo=on, 24.3s
sum=lsum8, halo=on, 28.1s
sum=local, halo=off, 34.0s
sum=lsum8, halo=off, 41.3s

No answers are bit-for-bit, as expected. The timing is pretty surprising though, turning the halo off slows down the VP code significantly. That seems more important than the global sum implementation. Is that halo update actually important for the preconditioning and efficiency of the solver? Maybe I'm doing something wrong in my testing? Again, turning on the haloUpdate at line 3348 is always better than having it off in overall time.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 28, 2023

@phil-blain
Copy link
Member

Another question since I'm looking at the VP code, is there a bug at these lines? Should the RHS be workspace_y?

https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3123 https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3531

Yes, these look like copy-paste errors, should be workspace_y on the RHS indeed.

@phil-blain
Copy link
Member

@phil-blain, I'm doing some timing tests with the global sums including with VP. I'm trying to understand this section of the vp, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3348. The haloupdate is not done when bfbflag='off'. Why is this [...] ?

So at line 3348:

! Update workspace with boundary values
! NOTE: skipped for efficiency since this is just a preconditioner
! unless bfbflag is active
if (bfbflag /= 'off') then

we are in the pgmres subroutine which is used as a preconditioner for the linear solver (subroutine fgmres). The idea behind skipping these halo updates here was that since it's just a preconditionner, we don't care about an exactly correct answer, we just want to bring the iterate a little bit closer to the solution so that the FGMRES solver can do its job more easily. In #774 I added the halo updates depending on bfbflag to allow bit4bit reproducibility if desired.

I must say I did not think of testing what you tested above, i.e. that it might make the overall runtime faster to always perform these halo updates. My hypothesis for explaining this behaviour is that the preconditioner yields a better starting point with these halo updates than without, so the FGMRES solver can reach the solution faster (and maybe similarly for the nonlinear solver, it depends on the tolerances)... @JFLemieux73 do you think this makes sense ?

We could verify this using the monitor_{nonlin,fgmres,pgmres} settings in the namelist and looking at the tolerances and number of iterations at each level.

and is it related to how the computation of the global_sum is done when bfbflag='off' here, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L2556 ?

No, it's not related. At 2556:

! Use local summation result unless bfbflag is active
if (bfbflag == 'off') then
dot_product = global_sum(sum(dot), distrb_info)
else
dot_product = global_sum(prod, distrb_info, field_loc_NEcorner)
endif

we are in global_dot_product and when b4b reproducibility is desired we have to do a global reduction, so we call global_sum on the prod array. If we don't care for reproducibility we simply do a local reduction and call global_sum on the scalar locally reduced values.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 29, 2023

Thanks @phil-blain.

I'll fix the two typos workspace copies in line 3123 and 3531, no problem.

Regarding performance, I did a couple runs with/without the haloupdate on for otherwise the same settings with the VP monitor namelists set to true. Again, I'm not sure what's going on, but looking at the monitor output, with haloupdate "off" on the precondition, the iter_fgmres is typically in the 20-30 range. For each fgm iteration, pgmres seems to do 5 iterations. When I run the same test with haloupdate "on" on the precondition, the iter_fgmres is about half, say 10-15 range, the pgmres stays the same at 5 iterations per fgmres iteration. So I think that does confirm that the haloupdate "on" reduces the number of iterations and seems to reduce the overall time even though we pay for the haloupdate. Are there some other tests I can/should do? The test I'm running is with "dynpicard" (kdyn=3,algo_nonlin='picard'). Haven't checked performance of other vp settings. Should we remove this check on bfbflag on the haloupdate on the preconditioner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants