Refactor global sums to improve performance #780

apcraig · 2022-10-27T16:33:53Z

The global sums have traditionally been used only for diagnostics in the log file. The VP solver now also uses global sums, see #774. The CICE generic global sums could be refactored to improve performance, shifting if statements associated with optional arguments relative to loops and similar.

apcraig · 2023-03-28T20:02:37Z

@phil-blain, I'm doing some timing tests with the global sums including with VP. I'm trying to understand this section of the vp, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3348. The haloupdate is not done when bfbflag='off'. Why is this and is it related to how the computation of the global_sum is done when bfbflag='off' here, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L2556?

The reason I ask is that the performance and results change in rather surprising ways if I comment out one or the other bfbflag='off' logic in VP. One goal is to improve the global sum performance in the infrastructure such that there is no performance degradatation in VP when using it. But I think I also need to understand if/when/why the haloupdate is executed at line 3348 under various settings of the bfbflag. Thanks.

apcraig · 2023-03-28T20:31:24Z

Some Dynamics timing data from various tests. I'm setting the VP global_sum to "local" or "lsum8", and I'm setting the haloUpdate at line 3348 to "off" or "on" manually via code mods. Here's the timing

sum=local, halo=on, 24.3s
sum=lsum8, halo=on, 28.1s
sum=local, halo=off, 34.0s
sum=lsum8, halo=off, 41.3s

No answers are bit-for-bit, as expected. The timing is pretty surprising though, turning the halo off slows down the VP code significantly. That seems more important than the global sum implementation. Is that halo update actually important for the preconditioning and efficiency of the solver? Maybe I'm doing something wrong in my testing? Again, turning on the haloUpdate at line 3348 is always better than having it off in overall time.

apcraig · 2023-03-28T21:58:13Z

Another question since I'm looking at the VP code, is there a bug at these lines? Should the RHS be workspace_y?

https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3123
https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3531

phil-blain · 2023-03-29T11:50:31Z

Another question since I'm looking at the VP code, is there a bug at these lines? Should the RHS be workspace_y?

https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3123 https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3531

Yes, these look like copy-paste errors, should be workspace_y on the RHS indeed.

phil-blain · 2023-03-29T12:26:11Z

@phil-blain, I'm doing some timing tests with the global sums including with VP. I'm trying to understand this section of the vp, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L3348. The haloupdate is not done when bfbflag='off'. Why is this [...] ?

So at line 3348:

CICE/cicecore/cicedyn/dynamics/ice_dyn_vp.F90

Lines 3345 to 3348 in 9424497

    
           ! Update workspace with boundary values 
        
           ! NOTE: skipped for efficiency since this is just a preconditioner 
        
           ! unless bfbflag is active 
        
           if (bfbflag /= 'off') then

we are in the pgmres subroutine which is used as a preconditioner for the linear solver (subroutine fgmres). The idea behind skipping these halo updates here was that since it's just a preconditionner, we don't care about an exactly correct answer, we just want to bring the iterate a little bit closer to the solution so that the FGMRES solver can do its job more easily. In #774 I added the halo updates depending on bfbflag to allow bit4bit reproducibility if desired.

I must say I did not think of testing what you tested above, i.e. that it might make the overall runtime faster to always perform these halo updates. My hypothesis for explaining this behaviour is that the preconditioner yields a better starting point with these halo updates than without, so the FGMRES solver can reach the solution faster (and maybe similarly for the nonlinear solver, it depends on the tolerances)... @JFLemieux73 do you think this makes sense ?

We could verify this using the monitor_{nonlin,fgmres,pgmres} settings in the namelist and looking at the tolerances and number of iterations at each level.

and is it related to how the computation of the global_sum is done when bfbflag='off' here, https://github.com/CICE-Consortium/CICE/blob/main/cicecore/cicedyn/dynamics/ice_dyn_vp.F90#L2556 ?

No, it's not related. At 2556:

CICE/cicecore/cicedyn/dynamics/ice_dyn_vp.F90

Lines 2555 to 2560 in 9424497

    
           ! Use local summation result unless bfbflag is active 
        
           if (bfbflag == 'off') then 
        
              dot_product = global_sum(sum(dot), distrb_info) 
        
           else 
        
              dot_product = global_sum(prod, distrb_info, field_loc_NEcorner) 
        
           endif

we are in global_dot_product and when b4b reproducibility is desired we have to do a global reduction, so we call global_sum on the prod array. If we don't care for reproducibility we simply do a local reduction and call global_sum on the scalar locally reduced values.

apcraig · 2023-03-29T15:37:38Z

Thanks @phil-blain.

I'll fix the two typos workspace copies in line 3123 and 3531, no problem.

Regarding performance, I did a couple runs with/without the haloupdate on for otherwise the same settings with the VP monitor namelists set to true. Again, I'm not sure what's going on, but looking at the monitor output, with haloupdate "off" on the precondition, the iter_fgmres is typically in the 20-30 range. For each fgm iteration, pgmres seems to do 5 iterations. When I run the same test with haloupdate "on" on the precondition, the iter_fgmres is about half, say 10-15 range, the pgmres stays the same at 5 iterations per fgmres iteration. So I think that does confirm that the haloupdate "on" reduces the number of iterations and seems to reduce the overall time even though we pay for the haloupdate. Are there some other tests I can/should do? The test I'm running is with "dynpicard" (kdyn=3,algo_nonlin='picard'). Haven't checked performance of other vp settings. Should we remove this check on bfbflag on the haloupdate on the preconditioner?

apcraig added the Software Engineering label Oct 27, 2022

apcraig self-assigned this Oct 27, 2022

apcraig mentioned this issue Apr 3, 2023

Update global reduction implementation to improve performance, fix VP bug #824

Merged

18 tasks

apcraig linked a pull request Apr 3, 2023 that will close this issue

Update global reduction implementation to improve performance, fix VP bug #824

Merged

18 tasks

apcraig closed this as completed in #824 Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor global sums to improve performance #780

Refactor global sums to improve performance #780

apcraig commented Oct 27, 2022

apcraig commented Mar 28, 2023

apcraig commented Mar 28, 2023

apcraig commented Mar 28, 2023

phil-blain commented Mar 29, 2023

phil-blain commented Mar 29, 2023

apcraig commented Mar 29, 2023

Refactor global sums to improve performance #780

Refactor global sums to improve performance #780

Comments

apcraig commented Oct 27, 2022

apcraig commented Mar 28, 2023

apcraig commented Mar 28, 2023

apcraig commented Mar 28, 2023

phil-blain commented Mar 29, 2023

phil-blain commented Mar 29, 2023

apcraig commented Mar 29, 2023