Enhance MODE to use OpenMP to make the convolution step faster #2724

JohnHalleyGotway · 2023-11-02T15:50:52Z

Describe the Enhancement

MET#1926 added OpenMP to parallelize the computation of fractional coverage fields. The same approach can easily be applied to the convolution step in MODE. This issue is to reimplement ShapeData::conv_filter_circ(...) using the same OpenMP-wrapped algorithm employed by the fractional_coverage() utility function.

Time Estimate

1 day.

Sub-Issues

Consider breaking the enhancement down into sub-issues.
None needed.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

Define the source of funding and account keys here or state NONE.

Define the Metadata

Assignee

Select engineer(s) or no engineer required
Select scientist(s) or no scientist required

Labels

Review default alert labels
Select component(s)
Select priority
Select requestor(s)

Milestone and Projects

Select Milestone as the next official version or Backlog of Development Ideas
For the next official version, select the MET-X.Y.Z Development project

Define Related Issue(s)

Consider the impact to the other METplus components.

METplus, MET, METdataio, METviewer, METexpress, METcalcpy, METplotpy
No impacts.

Enhancement Checklist

See the METplus Workflow for details.

The text was updated successfully, but these errors were encountered:

…make it more efficient.

JohnHalleyGotway · 2023-11-02T19:01:05Z

Ran a simple test on seneca using HRRR data to compare 6-hour precip to itself.
Here are the timing results:

MET version 11.1.0 -> 2:21.65
Feature branch with OMP_NUM_THREADS = 1 -> 1:22.33
Feature branch with OMP_NUM_THREADS = 2 -> 1:14.89
Feature branch with OMP_NUM_THREADS = 4 -> 1:11.68
Feature branch with OMP_NUM_THREADS = 8 -> 1:10.47

The improvement from 1 to 2 is the result of swapping in a much more efficient looping algorithm.
The further improvements are caused by increasing the number of threads used for the convolution step.

… object for each field. Still more work to do to reduce memory usage and also apply OpenMP to the ShapeData::select() function.

…() to exactly reproduce existing results. There were some subtle diffs in the handling of missing data and points off the grid.

…ikely a way to make the memory usage for efficient but it'll require a tweak to the logic.

…ory allocation for ShapeData objects.

JohnHalleyGotway · 2023-11-03T04:22:09Z

@DanielAdriaansen, on 11/2/23, we discussed some additional refinements to MODE with the goal of minimizing unnecessary memory allocation. Currently, the fuzzy logic engine allocates memory for 1 copy of the entire domain for each forecast and observation object. In your case, running with 75 forecast and 75 observation objects on a 3km CONUS HRRR domain, that adds up to a lot of memory.

It was a little more involved than I expected, but the basic change we discussed is now in place on the feature_2727_mode_openmp branch. This GHA testing workflow ran without error and flagged no diffs (i.e. no red 'X'). Please re-run you test with this version of MODE:
seneca:/d1/personal/johnhg/MET/MET_development/MET-feature_2724_mode_openmp/bin/mode

Your run on 11/2/23 took just under 10 minutes to complete. Please let me know what the new runtime is.

I'll note that there is additional memory allocation done in the double-threshold merging step that could probably also be eliminated. I have an idea how how we could use STL maps to keep track of the simple object ids falling inside the merge objects and vice-versa. That should provide the information needed without allocating so many copies of the grid.

DanielAdriaansen · 2023-11-03T14:04:26Z

@JohnHalleyGotway I tested using
/d1/personal/johnhg/MET/MET_development/MET-feature_2724_mode_openmp/bin/mode with the same command as yesterday. Assuming that path is correct and the executable is updated, here are the results:

Yesterday:


real	9m57.844s
user	9m49.903s
sys	0m7.912s

This morning:

real	10m10.714s
user	10m4.256s
sys	0m6.404s

JohnHalleyGotway · 2023-11-03T16:15:15Z

@DanielAdriaansen thanks for re-testing and passing along the test you're using on seneca. I suspect the slowness is ultimately caused by MODE looping over the input domain many, many, many times. Here's some thoughts.

Add some print statements and observe the slowness between them.
- Your test case has 75 tiny forecast objects compared to the same set of 75 tiny observation objects.
- For each forecast object, MODE selects the object (looping over the grid) and then defines sets the single object attributes (looping over the grid at least once more).
- The set step is much slower than the select step.
- After doing this for all the forecast objects, it does the same for the observation objects. And then does it for all the forecast clusters and observation clusters. In this case the clusters are very similar to the simple objects, but are slightly different.
- The point is that we're doing a kinda slow thing 4 times... which adds up to 10 minutes of runtime.
Run your test command through the gprof profiler to identify big offenders.
- Here are the results: gprof_output.txt
- Over 50% of the run time is spent accessing data in calls to DataPlane::get(int, int), DataPlane::two_to_one(int, int, bool), and ShapeData::s_is_on(int, int, bool). The two_to_one and s_is_on functions range check x/y vs Nx/Ny every single time which slows it down. When we're accessing data inside of a loop through Nx and Ny, range checking isn't necessary. Recommend avoiding the range checking wherever possible and safe but accessing data directly through the buf() buffer option.

The DataPlane::sdebug_examine() function loops over the entire grid and can be slow.

Could replace log messages like this:

mlog << Debug(4) << " Before fcst convolution: " << fcst_conv->sdebug_examine() << "\n";

With

if(mlog.verbosity_level() >=4) mlog << Debug(4) << " Before fcst convolution: " << fcst_conv->sdebug_examine() << "\n";

That would prevent the computation cost we're incurring.

The ShapeData::n_objects() function calls split which loops over the entire grid and can be slow.
- Could eliminate calls to n_objects() or remove that function entirely to prevent its use.
Most of the time MODE loops over a grid that is very sparse, often only containing a single object. We could potentially add logic (maybe in the select function) to keep track of the min/max x/y values that contain the object. Rather than looping over the entire grid, we'd only need to loop over the smaller extent of x/y values. But that'd be pretty involved.
More to be added.

…were computing NMEP outputs. That was removed from ensemble-stat in MET version 11.1 but the OpenMP setup remained there. This removes it from ensemble-stat and updates the documentation to accurately indicate that OpenMP currently applies to gen-ens-prod, grid-stat, and now mode.

…hould be faster and use much less memory.

JohnHalleyGotway · 2023-11-03T22:31:00Z

Good news. This GHA run flagged no diffs. So my reimplementation of the double-thresholding to minimize memory use works.

…ions to be more efficient by accessing the vector of data rather than the slower get(x,y) data accessor function.

…efine their log messages

…bosity level to avoid unnecessary loops through the data. Note that all calls to the logger would actually create the log message and the logger decides whether or not to print it. Wrapping expensive debugging log messages in vebosity level check is more efficient.

… diff a bit more efficient by accessing the data() array directly rather than range-checking with the data(x,y) accessor function.

bonnystrong · 2023-12-04T18:14:09Z

When this is running on HPC, please contact Jeff Duda at GSL to test performance.

jprestop · 2023-12-04T23:12:11Z

@bonnystrong Which HPC would Jeff use to test performance?

bonnystrong · 2023-12-04T23:13:26Z

Either hera or jet. You should ask him for more specifics.

…

On Mon, Dec 4, 2023 at 4:12 PM Julie Prestopnik ***@***.***> wrote: @bonnystrong <https://github.com/bonnystrong> Which HPC would Jeff use to test performance? — Reply to this email directly, view it on GitHub <#2724 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG6HZOUCRI52C26HUGAIYGTYHZKFNAVCNFSM6AAAAAA63DTNPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGY4TENZRGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Bonny Strong NOAA/GSL and CIRA home: (970) 669-1188 or office: (719) 301-6195 DSRC office 2B147

JohnHalleyGotway added this to the MET 12.0.0 milestone Nov 2, 2023

JohnHalleyGotway self-assigned this Nov 2, 2023

JohnHalleyGotway added a commit that referenced this issue Nov 2, 2023

Per #2724, reimplement ShapeData::conv_filter_circ() using OpenMP to …

ba5cc6e

…make it more efficient.

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023

Per #2724, eliminate unnecessary memory allocation of a new ShapeData…

22de88f

… object for each field. Still more work to do to reduce memory usage and also apply OpenMP to the ShapeData::select() function.

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023

Per #2724, refining the implementation of ShapeData::conv_filter_circ…

042bf4f

…() to exactly reproduce existing results. There were some subtle diffs in the handling of missing data and points off the grid.

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023

Per #2724, revert back the premature changes in engine.cc. There is l…

5bdf1bd

…ikely a way to make the memory usage for efficient but it'll require a tweak to the logic.

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023

Per #2724, update ModeFuzzyEngine::do_cluster_features() to avoid mem…

f297e29

…ory allocation for ShapeData objects.

TaraJensen added reporting: DTC NOAA R2O NOAA Research to Operations DTC Project and removed alert: NEED ACCOUNT KEY Need to assign an account key to this issue labels Nov 3, 2023

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023

Per #2724, swap in new logic for the double-thresholding check that s…

ff5e08c

…hould be faster and use much less memory.

JohnHalleyGotway removed the alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle label Nov 3, 2023

JohnHalleyGotway added a commit that referenced this issue Nov 6, 2023

Per #2724, refine DataPlane debug_examine() and sdebug_examine() funt…

1085c85

…ions to be more efficient by accessing the vector of data rather than the slower get(x,y) data accessor function.

JohnHalleyGotway added a commit that referenced this issue Nov 6, 2023

Per #2724, update sdebug_examine() and debug_examine() functions to r…

aa80374

…efine their log messages

JohnHalleyGotway mentioned this issue Nov 6, 2023

Feature #2724 mode_openmp #2726

Merged

15 tasks

JohnHalleyGotway linked a pull request Nov 6, 2023 that will close this issue

Feature #2724 mode_openmp #2726

Merged

15 tasks

JohnHalleyGotway mentioned this issue Nov 6, 2023

Review MODE implementation to look for more computational efficiencies #2727

Open

22 tasks

JohnHalleyGotway added a commit that referenced this issue Nov 7, 2023

Per #2724, make the computation of union, intersection, and symmetric…

fa9ea77

… diff a bit more efficient by accessing the data() array directly rather than range-checking with the data(x,y) accessor function.

JohnHalleyGotway added a commit that referenced this issue Nov 7, 2023

Feature #2724 mode_openmp (#2726)

46cf0db

JohnHalleyGotway closed this as completed Nov 7, 2023

JohnHalleyGotway changed the title ~~Enhance MODE to use OpenMP for efficient computation of the convolution step~~ Enhance MODE to use OpenMP to make the convolution step faster Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance MODE to use OpenMP to make the convolution step faster #2724

Enhance MODE to use OpenMP to make the convolution step faster #2724

JohnHalleyGotway commented Nov 2, 2023 •

edited

Loading

JohnHalleyGotway commented Nov 2, 2023

JohnHalleyGotway commented Nov 3, 2023 •

edited

Loading

DanielAdriaansen commented Nov 3, 2023

JohnHalleyGotway commented Nov 3, 2023 •

edited

Loading

JohnHalleyGotway commented Nov 3, 2023

bonnystrong commented Dec 4, 2023

jprestop commented Dec 4, 2023

bonnystrong commented Dec 4, 2023 via email

Enhance MODE to use OpenMP to make the convolution step faster #2724

Enhance MODE to use OpenMP to make the convolution step faster #2724

Comments

JohnHalleyGotway commented Nov 2, 2023 • edited Loading

Describe the Enhancement

Time Estimate

Sub-Issues

Relevant Deadlines

Funding Source

Define the Metadata

Assignee

Labels

Milestone and Projects

Define Related Issue(s)

Enhancement Checklist

JohnHalleyGotway commented Nov 2, 2023

JohnHalleyGotway commented Nov 3, 2023 • edited Loading

DanielAdriaansen commented Nov 3, 2023

JohnHalleyGotway commented Nov 3, 2023 • edited Loading

JohnHalleyGotway commented Nov 3, 2023

bonnystrong commented Dec 4, 2023

jprestop commented Dec 4, 2023

bonnystrong commented Dec 4, 2023 via email

JohnHalleyGotway commented Nov 2, 2023 •

edited

Loading

JohnHalleyGotway commented Nov 3, 2023 •

edited

Loading

JohnHalleyGotway commented Nov 3, 2023 •

edited

Loading