-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intrinsics: Faster cuberoot scaling functions #557
Intrinsics: Faster cuberoot scaling functions #557
Conversation
00fd227
to
4927caf
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev/gfdl #557 +/- ##
============================================
+ Coverage 37.20% 37.22% +0.02%
============================================
Files 271 271
Lines 80355 80384 +29
Branches 14985 14985
============================================
+ Hits 29897 29925 +28
Misses 44902 44902
- Partials 5556 5557 +1 ☔ View full report in Codecov by Sentry. |
4927caf
to
0cadb77
Compare
Some comments from @Hallberg-NOAA
|
I added two commits which address the requested changes:
The second will probably change answers (again), and it does add some more weird variables to hold the cube values, so we can decide if it's really worth the fight. |
5781236
to
2b35272
Compare
I forgot to add the test case that we talked about. I also added @Hallberg-NOAA as a coauthor to the rescale refactor commit. |
Good thing I added those tests... looking into it now. |
2b35272
to
6e4d5c1
Compare
This seems to be passing now, with no regression on GitHub. (Consistent since GNU seems to always do |
I think that we need to add one more test of the cuberoot function - of |
be4f2c2
to
cc36692
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have examined these changes, and (to the extent that I can follow all of the clever bit-manipulations) I am convinced that they are correct. Moreover, I think that we now have pretty thorough testing of this new capability that would detect anything that is not working as intended. I am therefore approving this code.
I am still a little nervous about whether our assumptions that we will have 0.5 <= root_asx < 1.0
on line 112, or whether there is a tiny (of order 1 in 10^16) chance that we would actually generate some case where root_asx == 1.0
, and hence that our a our assumption that the base-2 exponent of root_asx
is always -1 would be wrong, and that we do not have any code to handle this case. Adding such a test should not be too hard (previous version of the code did add the extra test), but it would make this new routine slower. I think that we have done our best to test for this, and it appears not to occur, but I am still nervous anyway.
FWIW there is an opportunity to write the exponent as |
You were right: This value: Ed: We can write it in the test as |
I am withdrawing the approval in light of the further investigations showing that the current code is not quite right for specific values very near to 1.
cc36692
to
3f93358
Compare
I've addressed the I also changed the unit test so that it's actually capable of detecting this problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now happy to approve this valuable and thorough PR without any further concerns.
This patch replaces the intrinsic-based exponent rescaling with explicit bit manipulation of the floating point number. This appears to produce a ~2.5x speedup of the solver, reducing its time from embarassingly slow to disappointingly slow. It is slightly faster than the GNU cbrt function, but still about 3x slower than the Intel SVML cbrt function. Timings (s) (16M array, -O3 -mavx -mfma) | Solver | -O2 | -O3 | |---------------------|-------|-------| | GNU x**1/3 | 0.225 | 0.198 | | GNU cuberoot before | 0.418 | 0.412 | | GNU cuberoot after | 0.208 | 0.187 | | Intel x**1/3 | 0.068 | 0.067 | | Intel before | 0.514 | 0.507 | | Intel after | 0.213 | 0.189 | At least one issue here is that Intel SVML is using fast vectorized logic operators whereas the Fortran intrinsics are replaced with slower legacy scalar versions. Not sure there is much we could even do about that without complaining to vendors. Also, I'm sure there's magic in their solvers which we are not capturing. Regardless, I think this is a major improvement. I do not believe it will change answers, but probably a good idea to verify this and get it in before committing any solutions using cuberoot().
Some modifications were made to the cuberoot rescale and descale functions: * The machine parameters were moved from function to module parameters. This could dangerously expose them to other functions, but it prevents multiple definitions of the same numbers. * The exponent is now cube-rooted in rescale rather than descale. * The exponent expressions are broken into more explicit steps, rather than combining multiple steps and assumptions into a single expression. * The bias is no longer assumed to be a multiple of three. This is true for double precision but not single precision. A new test of quasi-random number was also added to the cuberoot test suite. These numbers were able to detect the differences in GNU and Intel compiler output. A potential error in the return value of the test was also fixed. The volatile test of 1 - 0.5*ULP has been added. The cube root of this value rounds to 1, and needs to be handled carefully. The unit test function `cuberoot(v**3)` was reversed to `cuberoot(v)**`, to include testing of this value. (Cubing would wipe out the anomaly.)
In separate testing, we observed that Intel would use the `pow()` function to evaluate the cubes of some numbers, causing different answers with GNU. In this patch, I replace the cubic x**3 operations with explicit x*x*x multiplication, which appears to avoid this substitution. Well, for the moment, at least.
3f93358
to
da9dd8a
Compare
This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/22161. |
This patch replaces the intrinsic-based exponent rescaling with explicit bit manipulation of the floating point number.
This appears to produce a ~2.5x speedup of the solver, reducing its time from embarassingly slow to disappointingly slow. It is slightly faster than the GNU cbrt function, but still about 3x slower than the Intel SVML cbrt function.
Timings (s) (16M array, -O3 -mavx -mfma, obviously many details omitted...)
At least one issue here is that Intel SVML is using fast vectorized logic operators whereas the Fortran intrinsics are replaced with slower legacy scalar versions. Not sure there is much we could even do about that without complaining to vendors.
Also, I'm sure there's magic in their solvers which we are not capturing. Regardless, I think this is a major improvement.
I do not believe it will change answers, but probably a good idea to verify this and get it in before committing any solutions using cuberoot().