Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance MODE to use OpenMP to make the convolution step faster #2724

Closed
20 of 21 tasks
JohnHalleyGotway opened this issue Nov 2, 2023 · 8 comments · Fixed by #2726
Closed
20 of 21 tasks

Enhance MODE to use OpenMP to make the convolution step faster #2724

JohnHalleyGotway opened this issue Nov 2, 2023 · 8 comments · Fixed by #2726
Assignees
Labels
component: code optimization Code optimization issue MET: Feature Verification priority: medium Medium Priority reporting: DTC NOAA R2O NOAA Research to Operations DTC Project requestor: METplus Team METplus Development Team type: enhancement Improve something that it is currently doing
Milestone

Comments

@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Nov 2, 2023

Describe the Enhancement

MET#1926 added OpenMP to parallelize the computation of fractional coverage fields. The same approach can easily be applied to the convolution step in MODE. This issue is to reimplement ShapeData::conv_filter_circ(...) using the same OpenMP-wrapped algorithm employed by the fractional_coverage() utility function.

Time Estimate

1 day.

Sub-Issues

Consider breaking the enhancement down into sub-issues.
None needed.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

Define the source of funding and account keys here or state NONE.

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Review default alert labels
  • Select component(s)
  • Select priority
  • Select requestor(s)

Milestone and Projects

  • Select Milestone as the next official version or Backlog of Development Ideas
  • For the next official version, select the MET-X.Y.Z Development project

Define Related Issue(s)

Consider the impact to the other METplus components.

Enhancement Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding Source.
  • Fork this repository or create a branch of develop.
    Branch name: feature_<Issue Number>_<Description>
  • Complete the development and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into develop.
    Pull request: feature <Issue Number> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next official version
    Select: MET-X.Y.Z Development project for development toward the next official release
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Close this issue.
@JohnHalleyGotway JohnHalleyGotway added type: enhancement Improve something that it is currently doing priority: medium Medium Priority alert: NEED ACCOUNT KEY Need to assign an account key to this issue alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle component: code optimization Code optimization issue requestor: METplus Team METplus Development Team MET: Feature Verification labels Nov 2, 2023
@JohnHalleyGotway JohnHalleyGotway added this to the MET 12.0.0 milestone Nov 2, 2023
@JohnHalleyGotway JohnHalleyGotway self-assigned this Nov 2, 2023
JohnHalleyGotway added a commit that referenced this issue Nov 2, 2023
@JohnHalleyGotway
Copy link
Collaborator Author

Ran a simple test on seneca using HRRR data to compare 6-hour precip to itself.
Here are the timing results:

  1. MET version 11.1.0 -> 2:21.65
  2. Feature branch with OMP_NUM_THREADS = 1 -> 1:22.33
  3. Feature branch with OMP_NUM_THREADS = 2 -> 1:14.89
  4. Feature branch with OMP_NUM_THREADS = 4 -> 1:11.68
  5. Feature branch with OMP_NUM_THREADS = 8 -> 1:10.47

The improvement from 1 to 2 is the result of swapping in a much more efficient looping algorithm.
The further improvements are caused by increasing the number of threads used for the convolution step.

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023
… object for each field. Still more work to do to reduce memory usage and also apply OpenMP to the ShapeData::select() function.
JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023
…() to exactly reproduce existing results. There were some subtle diffs in the handling of missing data and points off the grid.
JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023
…ikely a way to make the memory usage for efficient but it'll require a tweak to the logic.
JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023
@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Nov 3, 2023

@DanielAdriaansen, on 11/2/23, we discussed some additional refinements to MODE with the goal of minimizing unnecessary memory allocation. Currently, the fuzzy logic engine allocates memory for 1 copy of the entire domain for each forecast and observation object. In your case, running with 75 forecast and 75 observation objects on a 3km CONUS HRRR domain, that adds up to a lot of memory.

It was a little more involved than I expected, but the basic change we discussed is now in place on the feature_2727_mode_openmp branch. This GHA testing workflow ran without error and flagged no diffs (i.e. no red 'X'). Please re-run you test with this version of MODE:
seneca:/d1/personal/johnhg/MET/MET_development/MET-feature_2724_mode_openmp/bin/mode

Your run on 11/2/23 took just under 10 minutes to complete. Please let me know what the new runtime is.

I'll note that there is additional memory allocation done in the double-threshold merging step that could probably also be eliminated. I have an idea how how we could use STL maps to keep track of the simple object ids falling inside the merge objects and vice-versa. That should provide the information needed without allocating so many copies of the grid.

@DanielAdriaansen
Copy link
Contributor

@JohnHalleyGotway I tested using
/d1/personal/johnhg/MET/MET_development/MET-feature_2724_mode_openmp/bin/mode with the same command as yesterday. Assuming that path is correct and the executable is updated, here are the results:

Yesterday:


real	9m57.844s
user	9m49.903s
sys	0m7.912s

This morning:

real	10m10.714s
user	10m4.256s
sys	0m6.404s

@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Nov 3, 2023

@DanielAdriaansen thanks for re-testing and passing along the test you're using on seneca. I suspect the slowness is ultimately caused by MODE looping over the input domain many, many, many times. Here's some thoughts.

  • Add some print statements and observe the slowness between them.
    • Your test case has 75 tiny forecast objects compared to the same set of 75 tiny observation objects.
    • For each forecast object, MODE selects the object (looping over the grid) and then defines sets the single object attributes (looping over the grid at least once more).
    • The set step is much slower than the select step.
    • After doing this for all the forecast objects, it does the same for the observation objects. And then does it for all the forecast clusters and observation clusters. In this case the clusters are very similar to the simple objects, but are slightly different.
    • The point is that we're doing a kinda slow thing 4 times... which adds up to 10 minutes of runtime.
  • Run your test command through the gprof profiler to identify big offenders.
    • Here are the results: gprof_output.txt
    • Over 50% of the run time is spent accessing data in calls to DataPlane::get(int, int), DataPlane::two_to_one(int, int, bool), and ShapeData::s_is_on(int, int, bool). The two_to_one and s_is_on functions range check x/y vs Nx/Ny every single time which slows it down. When we're accessing data inside of a loop through Nx and Ny, range checking isn't necessary. Recommend avoiding the range checking wherever possible and safe but accessing data directly through the buf() buffer option.
  • The DataPlane::sdebug_examine() function loops over the entire grid and can be slow.
    • Could replace log messages like this:
    mlog << Debug(4) << " Before fcst convolution: " << fcst_conv->sdebug_examine() << "\n";
    
    With
    if(mlog.verbosity_level() >=4) mlog << Debug(4) << " Before fcst convolution: " << fcst_conv->sdebug_examine() << "\n";
    
    • That would prevent the computation cost we're incurring.
  • The ShapeData::n_objects() function calls split which loops over the entire grid and can be slow.
    • Could eliminate calls to n_objects() or remove that function entirely to prevent its use.
  • Most of the time MODE loops over a grid that is very sparse, often only containing a single object. We could potentially add logic (maybe in the select function) to keep track of the min/max x/y values that contain the object. Rather than looping over the entire grid, we'd only need to loop over the smaller extent of x/y values. But that'd be pretty involved.
  • More to be added.

JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023
…were computing NMEP outputs. That was removed from ensemble-stat in MET version 11.1 but the OpenMP setup remained there. This removes it from ensemble-stat and updates the documentation to accurately indicate that OpenMP currently applies to gen-ens-prod, grid-stat, and now mode.
@TaraJensen TaraJensen added reporting: DTC NOAA R2O NOAA Research to Operations DTC Project and removed alert: NEED ACCOUNT KEY Need to assign an account key to this issue labels Nov 3, 2023
JohnHalleyGotway added a commit that referenced this issue Nov 3, 2023
@JohnHalleyGotway JohnHalleyGotway removed the alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle label Nov 3, 2023
@JohnHalleyGotway
Copy link
Collaborator Author

Good news. This GHA run flagged no diffs. So my reimplementation of the double-thresholding to minimize memory use works.

JohnHalleyGotway added a commit that referenced this issue Nov 6, 2023
…ions to be more efficient by accessing the vector of data rather than the slower get(x,y) data accessor function.
JohnHalleyGotway added a commit that referenced this issue Nov 6, 2023
JohnHalleyGotway added a commit that referenced this issue Nov 6, 2023
…bosity level to avoid unnecessary loops through the data. Note that all calls to the logger would actually create the log message and the logger decides whether or not to print it. Wrapping expensive debugging log messages in vebosity level check is more efficient.
@JohnHalleyGotway JohnHalleyGotway linked a pull request Nov 6, 2023 that will close this issue
15 tasks
JohnHalleyGotway added a commit that referenced this issue Nov 7, 2023
… diff a bit more efficient by accessing the data() array directly rather than range-checking with the data(x,y) accessor function.
JohnHalleyGotway added a commit that referenced this issue Nov 7, 2023
@JohnHalleyGotway JohnHalleyGotway changed the title Enhance MODE to use OpenMP for efficient computation of the convolution step Enhance MODE to use OpenMP to make the convolution step faster Nov 17, 2023
@bonnystrong
Copy link

When this is running on HPC, please contact Jeff Duda at GSL to test performance.

@jprestop
Copy link
Collaborator

jprestop commented Dec 4, 2023

@bonnystrong Which HPC would Jeff use to test performance?

@bonnystrong
Copy link

bonnystrong commented Dec 4, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: code optimization Code optimization issue MET: Feature Verification priority: medium Medium Priority reporting: DTC NOAA R2O NOAA Research to Operations DTC Project requestor: METplus Team METplus Development Team type: enhancement Improve something that it is currently doing
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

5 participants