Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use vendored version of cupy.pad with added performance optimizations #482

Merged
merged 3 commits into from
Feb 2, 2023

Conversation

grlee77
Copy link
Contributor

@grlee77 grlee77 commented Jan 24, 2023

Overview

This version provides faster, elementwise kernel implementations for common padding modes.

It is under _vendored because most of pad.py is copied from CuPy itself. The only new part there is the _use_elementwise_kernel utility and the conditional branch where it evaluates to True. The newly written code is mostly in pad_elementwise.py.

I could potentially further refactor pad.py to remove most of the code and just call out to cupy.pad instead whenever we aren't using the elementwise kernels.

This version should also be submited upstream to CuPy itself.

Padding performance is substantially improved for modes edge, symmetric, reflect and wrap. Most places in cuCIM where we use padding, it is not the bottleneck, but it should still provide a small performance improvement in several places. I ran some benchmarks, and the largest impact I saw was around 25% reduction in run-time for chan_vese.

Benchmark Results (vs. cupy.pad)

In the following, the next-to-last column is the overall acceleration observed. It is large for small 2D or 3D images (>5x) and becomes relatively small for larger images (e.g. ~10% for 4k images).

The final column only relates to the amount of time spent on the host. That "accel. CPU" number always strongly favors the new implementation. It has lower host overhead because everything is done in a single kernel call rather than potentially using multiple kernels for each axis in turn. This kernel launch overhead explains why the overall benefit is much higher for the smaller image sizes.

shape pad_width dtype mode order duration, old (ms) duration, new (ms) accel. accel. CPU
(256, 256) 2 uint8 edge C 0.1278 0.0230 5.563 6.298
(256, 256) 2 uint8 symmetric C 0.1286 0.0230 5.583 6.268
(256, 256) 2 uint8 reflect C 0.1294 0.0236 5.479 6.165
(256, 256) 2 uint8 wrap C 0.1246 0.0228 5.468 6.149
(256, 256) 16 uint8 edge C 0.1276 0.0229 5.563 6.269
(256, 256) 16 uint8 symmetric C 0.1305 0.0231 5.645 6.366
(256, 256) 16 uint8 reflect C 0.1300 0.0235 5.539 6.220
(256, 256) 16 uint8 wrap C 0.1270 0.0228 5.568 6.268
(256, 256) 2 uint8 edge F 0.1300 0.0234 5.567 6.281
(256, 256) 2 uint8 symmetric F 0.1291 0.0236 5.471 6.157
(256, 256) 2 uint8 reflect F 0.1294 0.0238 5.427 6.080
(256, 256) 2 uint8 wrap F 0.1254 0.0234 5.363 6.043
(256, 256) 16 uint8 edge F 0.1279 0.0232 5.506 6.315
(256, 256) 16 uint8 symmetric F 0.1294 0.0236 5.472 6.319
(256, 256) 16 uint8 reflect F 0.1300 0.0239 5.434 6.262
(256, 256) 16 uint8 wrap F 0.1262 0.0238 5.310 6.134
(1024, 1024) 2 uint8 edge C 0.1279 0.0255 5.020 6.287
(1024, 1024) 2 uint8 symmetric C 0.1285 0.0258 4.980 6.259
(1024, 1024) 2 uint8 reflect C 0.1286 0.0263 4.888 6.118
(1024, 1024) 2 uint8 wrap C 0.1253 0.0255 4.905 6.170
(1024, 1024) 16 uint8 edge C 0.1277 0.0258 4.947 6.270
(1024, 1024) 16 uint8 symmetric C 0.1286 0.0261 4.931 6.296
(1024, 1024) 16 uint8 reflect C 0.1280 0.0264 4.845 6.132
(1024, 1024) 16 uint8 wrap C 0.1249 0.0260 4.798 6.095
(1024, 1024) 2 uint8 edge F 0.1289 0.0581 2.217 6.084
(1024, 1024) 2 uint8 symmetric F 0.1304 0.0586 2.227 6.064
(1024, 1024) 2 uint8 reflect F 0.1331 0.0590 2.257 6.059
(1024, 1024) 2 uint8 wrap F 0.1278 0.0586 2.180 5.994
(1024, 1024) 16 uint8 edge F 0.1299 0.0604 2.149 6.238
(1024, 1024) 16 uint8 symmetric F 0.1315 0.0607 2.168 6.255
(1024, 1024) 16 uint8 reflect F 0.1309 0.0614 2.133 6.070
(1024, 1024) 16 uint8 wrap F 0.1275 0.0606 2.103 6.105
(4096, 4096) 2 uint8 edge C 0.1291 0.1143 1.130 6.202
(4096, 4096) 2 uint8 symmetric C 0.1296 0.1132 1.145 6.183
(4096, 4096) 2 uint8 reflect C 0.1295 0.1151 1.125 6.064
(4096, 4096) 2 uint8 wrap C 0.1266 0.1138 1.112 6.029
(4096, 4096) 16 uint8 edge C 0.1295 0.1157 1.119 6.212
(4096, 4096) 16 uint8 symmetric C 0.1301 0.1150 1.131 6.208
(4096, 4096) 16 uint8 reflect C 0.1302 0.1168 1.115 6.088
(4096, 4096) 16 uint8 wrap C 0.1272 0.1153 1.103 6.065
(4096, 4096) 2 uint8 edge F 0.6624 0.6433 1.030 6.228
(4096, 4096) 2 uint8 symmetric F 0.6639 0.6438 1.031 6.133
(4096, 4096) 2 uint8 reflect F 0.6640 0.6441 1.031 6.003
(4096, 4096) 2 uint8 wrap F 0.6638 0.6454 1.028 6.037
(4096, 4096) 16 uint8 edge F 0.6909 0.6713 1.029 6.318
(4096, 4096) 16 uint8 symmetric F 0.6915 0.6717 1.029 6.229
(4096, 4096) 16 uint8 reflect F 0.6919 0.6724 1.029 6.082
(4096, 4096) 16 uint8 wrap F 0.6923 0.6720 1.030 6.136
(40, 40, 40) 2 uint8 edge C 0.2057 0.0239 8.610 9.765
(40, 40, 40) 2 uint8 symmetric C 0.2014 0.0241 8.357 9.450
(40, 40, 40) 2 uint8 reflect C 0.1999 0.0245 8.169 9.227
(40, 40, 40) 2 uint8 wrap C 0.1969 0.0237 8.299 9.405
(40, 40, 40) 16 uint8 edge C 0.2028 0.0235 8.633 9.760
(40, 40, 40) 16 uint8 symmetric C 0.2000 0.0255 7.844 9.502
(40, 40, 40) 16 uint8 reflect C 0.1988 0.0250 7.946 9.339
(40, 40, 40) 16 uint8 wrap C 0.1948 0.0248 7.871 9.371
(40, 40, 40) 2 uint8 edge F 0.1980 0.0248 7.994 9.322
(40, 40, 40) 2 uint8 symmetric F 0.1963 0.0250 7.840 9.159
(40, 40, 40) 2 uint8 reflect F 0.1952 0.0253 7.729 8.985
(40, 40, 40) 2 uint8 wrap F 0.1898 0.0251 7.567 8.847
(40, 40, 40) 16 uint8 edge F 0.1997 0.0331 6.035 9.161
(40, 40, 40) 16 uint8 symmetric F 0.1964 0.0349 5.622 8.393
(40, 40, 40) 16 uint8 reflect F 0.1967 0.0339 5.793 8.808
(40, 40, 40) 16 uint8 wrap F 0.1924 0.0334 5.762 8.909
(100, 100, 100) 2 uint8 edge C 0.2042 0.0288 7.101 9.676
(100, 100, 100) 2 uint8 symmetric C 0.1994 0.0317 6.294 9.334
(100, 100, 100) 2 uint8 reflect C 0.2007 0.0302 6.634 9.287
(100, 100, 100) 2 uint8 wrap C 0.1946 0.0308 6.315 9.179
(100, 100, 100) 16 uint8 edge C 0.2023 0.0369 5.483 9.468
(100, 100, 100) 16 uint8 symmetric C 0.2012 0.0451 4.465 9.197
(100, 100, 100) 16 uint8 reflect C 0.2006 0.0411 4.886 9.073
(100, 100, 100) 16 uint8 wrap C 0.1958 0.0429 4.561 9.060
(100, 100, 100) 2 uint8 edge F 0.1996 0.0630 3.167 9.158
(100, 100, 100) 2 uint8 symmetric F 0.1962 0.0636 3.084 8.816
(100, 100, 100) 2 uint8 reflect F 0.1957 0.0638 3.068 8.725
(100, 100, 100) 2 uint8 wrap F 0.1908 0.0636 2.999 8.719
(100, 100, 100) 16 uint8 edge F 0.2041 0.0995 2.052 9.048
(100, 100, 100) 16 uint8 symmetric F 0.2055 0.1039 1.978 8.858
(100, 100, 100) 16 uint8 reflect F 0.2040 0.1038 1.965 8.821
(100, 100, 100) 16 uint8 wrap F 0.1989 0.1071 1.858 8.757
(256, 256, 256) 2 uint8 edge C 0.2063 0.1495 1.380 9.652
(256, 256, 256) 2 uint8 symmetric C 0.2065 0.1613 1.280 9.647
(256, 256, 256) 2 uint8 reflect C 0.2055 0.1540 1.334 9.328
(256, 256, 256) 2 uint8 wrap C 0.1997 0.1569 1.273 9.326
(256, 256, 256) 16 uint8 edge C 0.2090 0.1973 1.060 9.704
(256, 256, 256) 16 uint8 symmetric C 0.2113 0.2419 0.873 9.573
(256, 256, 256) 16 uint8 reflect C 0.2131 0.2124 1.003 9.351
(256, 256, 256) 16 uint8 wrap C 0.2076 0.2311 0.899 9.410

@grlee77 grlee77 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change performance Performance improvement labels Jan 24, 2023
@grlee77 grlee77 requested a review from a team as a code owner January 24, 2023 18:33
@gigony gigony added this to the v23.02.00 milestone Jan 24, 2023
@codecov-commenter
Copy link

codecov-commenter commented Feb 1, 2023

Codecov Report

Base: 92.95% // Head: 92.89% // Decreases project coverage by -0.06% ⚠️

Coverage data is based on head (0c84ca2) compared to base (a8d9690).
Patch coverage: 90.32% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.02     #482      +/-   ##
================================================
- Coverage         92.95%   92.89%   -0.06%     
================================================
  Files               130      131       +1     
  Lines              9775     9905     +130     
================================================
+ Hits               9086     9201     +115     
- Misses              689      704      +15     
Impacted Files Coverage Δ
python/cucim/src/cucim/skimage/color/colorconv.py 97.97% <ø> (ø)
...thon/cucim/src/cucim/skimage/filters/_fft_based.py 92.00% <50.00%> (+0.16%) ⬆️
...on/cucim/src/cucim/skimage/filters/_median_hist.py 81.40% <83.33%> (-0.22%) ⬇️
...cucim/src/cucim/skimage/measure/_colocalization.py 86.20% <86.20%> (ø)
...m/skimage/registration/_phase_cross_correlation.py 94.82% <91.48%> (-1.69%) ⬇️
python/cucim/src/cucim/skimage/filters/_median.py 81.96% <91.66%> (+0.48%) ⬆️
...im/src/cucim/core/operations/morphology/_pba_2d.py 92.08% <100.00%> (+0.03%) ⬆️
...im/src/cucim/core/operations/morphology/_pba_3d.py 95.79% <100.00%> (+0.01%) ⬆️
python/cucim/src/cucim/skimage/_shared/utils.py 77.40% <100.00%> (+0.58%) ⬆️
...hon/cucim/src/cucim/skimage/exposure/_adapthist.py 97.45% <100.00%> (+0.02%) ⬆️
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@gigony gigony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

There is a typo but I think it can be corrected later if we don't want to spend extra time triggering CI/CD again.

python/cucim/src/cucim/skimage/feature/tests/test_blob.py Outdated Show resolved Hide resolved
This version provides faster, elementwise kernel implementations for common padding modes.
This version should also be submited upstream to CuPy itself.
apply isort

fix typo
@jakirkham
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 7fd07d0 into rapidsai:branch-23.02 Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change performance Performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants