use vendored version of cupy.pad with added performance optimizations #482

grlee77 · 2023-01-24T18:33:21Z

Overview

This version provides faster, elementwise kernel implementations for common padding modes.

It is under _vendored because most of pad.py is copied from CuPy itself. The only new part there is the _use_elementwise_kernel utility and the conditional branch where it evaluates to True. The newly written code is mostly in pad_elementwise.py.

I could potentially further refactor pad.py to remove most of the code and just call out to cupy.pad instead whenever we aren't using the elementwise kernels.

This version should also be submited upstream to CuPy itself.

Padding performance is substantially improved for modes edge, symmetric, reflect and wrap. Most places in cuCIM where we use padding, it is not the bottleneck, but it should still provide a small performance improvement in several places. I ran some benchmarks, and the largest impact I saw was around 25% reduction in run-time for chan_vese.

Benchmark Results (vs. `cupy.pad`)

In the following, the next-to-last column is the overall acceleration observed. It is large for small 2D or 3D images (>5x) and becomes relatively small for larger images (e.g. ~10% for 4k images).

The final column only relates to the amount of time spent on the host. That "accel. CPU" number always strongly favors the new implementation. It has lower host overhead because everything is done in a single kernel call rather than potentially using multiple kernels for each axis in turn. This kernel launch overhead explains why the overall benefit is much higher for the smaller image sizes.

shape	pad_width	dtype	mode	order	duration, old (ms)	duration, new (ms)	accel.	accel. CPU
(256, 256)	2	uint8	edge	C	0.1278	0.0230	5.563	6.298
(256, 256)	2	uint8	symmetric	C	0.1286	0.0230	5.583	6.268
(256, 256)	2	uint8	reflect	C	0.1294	0.0236	5.479	6.165
(256, 256)	2	uint8	wrap	C	0.1246	0.0228	5.468	6.149
(256, 256)	16	uint8	edge	C	0.1276	0.0229	5.563	6.269
(256, 256)	16	uint8	symmetric	C	0.1305	0.0231	5.645	6.366
(256, 256)	16	uint8	reflect	C	0.1300	0.0235	5.539	6.220
(256, 256)	16	uint8	wrap	C	0.1270	0.0228	5.568	6.268
(256, 256)	2	uint8	edge	F	0.1300	0.0234	5.567	6.281
(256, 256)	2	uint8	symmetric	F	0.1291	0.0236	5.471	6.157
(256, 256)	2	uint8	reflect	F	0.1294	0.0238	5.427	6.080
(256, 256)	2	uint8	wrap	F	0.1254	0.0234	5.363	6.043
(256, 256)	16	uint8	edge	F	0.1279	0.0232	5.506	6.315
(256, 256)	16	uint8	symmetric	F	0.1294	0.0236	5.472	6.319
(256, 256)	16	uint8	reflect	F	0.1300	0.0239	5.434	6.262
(256, 256)	16	uint8	wrap	F	0.1262	0.0238	5.310	6.134
(1024, 1024)	2	uint8	edge	C	0.1279	0.0255	5.020	6.287
(1024, 1024)	2	uint8	symmetric	C	0.1285	0.0258	4.980	6.259
(1024, 1024)	2	uint8	reflect	C	0.1286	0.0263	4.888	6.118
(1024, 1024)	2	uint8	wrap	C	0.1253	0.0255	4.905	6.170
(1024, 1024)	16	uint8	edge	C	0.1277	0.0258	4.947	6.270
(1024, 1024)	16	uint8	symmetric	C	0.1286	0.0261	4.931	6.296
(1024, 1024)	16	uint8	reflect	C	0.1280	0.0264	4.845	6.132
(1024, 1024)	16	uint8	wrap	C	0.1249	0.0260	4.798	6.095
(1024, 1024)	2	uint8	edge	F	0.1289	0.0581	2.217	6.084
(1024, 1024)	2	uint8	symmetric	F	0.1304	0.0586	2.227	6.064
(1024, 1024)	2	uint8	reflect	F	0.1331	0.0590	2.257	6.059
(1024, 1024)	2	uint8	wrap	F	0.1278	0.0586	2.180	5.994
(1024, 1024)	16	uint8	edge	F	0.1299	0.0604	2.149	6.238
(1024, 1024)	16	uint8	symmetric	F	0.1315	0.0607	2.168	6.255
(1024, 1024)	16	uint8	reflect	F	0.1309	0.0614	2.133	6.070
(1024, 1024)	16	uint8	wrap	F	0.1275	0.0606	2.103	6.105
(4096, 4096)	2	uint8	edge	C	0.1291	0.1143	1.130	6.202
(4096, 4096)	2	uint8	symmetric	C	0.1296	0.1132	1.145	6.183
(4096, 4096)	2	uint8	reflect	C	0.1295	0.1151	1.125	6.064
(4096, 4096)	2	uint8	wrap	C	0.1266	0.1138	1.112	6.029
(4096, 4096)	16	uint8	edge	C	0.1295	0.1157	1.119	6.212
(4096, 4096)	16	uint8	symmetric	C	0.1301	0.1150	1.131	6.208
(4096, 4096)	16	uint8	reflect	C	0.1302	0.1168	1.115	6.088
(4096, 4096)	16	uint8	wrap	C	0.1272	0.1153	1.103	6.065
(4096, 4096)	2	uint8	edge	F	0.6624	0.6433	1.030	6.228
(4096, 4096)	2	uint8	symmetric	F	0.6639	0.6438	1.031	6.133
(4096, 4096)	2	uint8	reflect	F	0.6640	0.6441	1.031	6.003
(4096, 4096)	2	uint8	wrap	F	0.6638	0.6454	1.028	6.037
(4096, 4096)	16	uint8	edge	F	0.6909	0.6713	1.029	6.318
(4096, 4096)	16	uint8	symmetric	F	0.6915	0.6717	1.029	6.229
(4096, 4096)	16	uint8	reflect	F	0.6919	0.6724	1.029	6.082
(4096, 4096)	16	uint8	wrap	F	0.6923	0.6720	1.030	6.136
(40, 40, 40)	2	uint8	edge	C	0.2057	0.0239	8.610	9.765
(40, 40, 40)	2	uint8	symmetric	C	0.2014	0.0241	8.357	9.450
(40, 40, 40)	2	uint8	reflect	C	0.1999	0.0245	8.169	9.227
(40, 40, 40)	2	uint8	wrap	C	0.1969	0.0237	8.299	9.405
(40, 40, 40)	16	uint8	edge	C	0.2028	0.0235	8.633	9.760
(40, 40, 40)	16	uint8	symmetric	C	0.2000	0.0255	7.844	9.502
(40, 40, 40)	16	uint8	reflect	C	0.1988	0.0250	7.946	9.339
(40, 40, 40)	16	uint8	wrap	C	0.1948	0.0248	7.871	9.371
(40, 40, 40)	2	uint8	edge	F	0.1980	0.0248	7.994	9.322
(40, 40, 40)	2	uint8	symmetric	F	0.1963	0.0250	7.840	9.159
(40, 40, 40)	2	uint8	reflect	F	0.1952	0.0253	7.729	8.985
(40, 40, 40)	2	uint8	wrap	F	0.1898	0.0251	7.567	8.847
(40, 40, 40)	16	uint8	edge	F	0.1997	0.0331	6.035	9.161
(40, 40, 40)	16	uint8	symmetric	F	0.1964	0.0349	5.622	8.393
(40, 40, 40)	16	uint8	reflect	F	0.1967	0.0339	5.793	8.808
(40, 40, 40)	16	uint8	wrap	F	0.1924	0.0334	5.762	8.909
(100, 100, 100)	2	uint8	edge	C	0.2042	0.0288	7.101	9.676
(100, 100, 100)	2	uint8	symmetric	C	0.1994	0.0317	6.294	9.334
(100, 100, 100)	2	uint8	reflect	C	0.2007	0.0302	6.634	9.287
(100, 100, 100)	2	uint8	wrap	C	0.1946	0.0308	6.315	9.179
(100, 100, 100)	16	uint8	edge	C	0.2023	0.0369	5.483	9.468
(100, 100, 100)	16	uint8	symmetric	C	0.2012	0.0451	4.465	9.197
(100, 100, 100)	16	uint8	reflect	C	0.2006	0.0411	4.886	9.073
(100, 100, 100)	16	uint8	wrap	C	0.1958	0.0429	4.561	9.060
(100, 100, 100)	2	uint8	edge	F	0.1996	0.0630	3.167	9.158
(100, 100, 100)	2	uint8	symmetric	F	0.1962	0.0636	3.084	8.816
(100, 100, 100)	2	uint8	reflect	F	0.1957	0.0638	3.068	8.725
(100, 100, 100)	2	uint8	wrap	F	0.1908	0.0636	2.999	8.719
(100, 100, 100)	16	uint8	edge	F	0.2041	0.0995	2.052	9.048
(100, 100, 100)	16	uint8	symmetric	F	0.2055	0.1039	1.978	8.858
(100, 100, 100)	16	uint8	reflect	F	0.2040	0.1038	1.965	8.821
(100, 100, 100)	16	uint8	wrap	F	0.1989	0.1071	1.858	8.757
(256, 256, 256)	2	uint8	edge	C	0.2063	0.1495	1.380	9.652
(256, 256, 256)	2	uint8	symmetric	C	0.2065	0.1613	1.280	9.647
(256, 256, 256)	2	uint8	reflect	C	0.2055	0.1540	1.334	9.328
(256, 256, 256)	2	uint8	wrap	C	0.1997	0.1569	1.273	9.326
(256, 256, 256)	16	uint8	edge	C	0.2090	0.1973	1.060	9.704
(256, 256, 256)	16	uint8	symmetric	C	0.2113	0.2419	0.873	9.573
(256, 256, 256)	16	uint8	reflect	C	0.2131	0.2124	1.003	9.351
(256, 256, 256)	16	uint8	wrap	C	0.2076	0.2311	0.899	9.410

codecov-commenter · 2023-02-01T22:00:12Z

Codecov Report

Base: 92.95% // Head: 92.89% // Decreases project coverage by -0.06% ⚠️

Coverage data is based on head (0c84ca2) compared to base (a8d9690).
Patch coverage: 90.32% of modified lines in pull request are covered.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-23.02     #482      +/-   ##
================================================
- Coverage         92.95%   92.89%   -0.06%     
================================================
  Files               130      131       +1     
  Lines              9775     9905     +130     
================================================
+ Hits               9086     9201     +115     
- Misses              689      704      +15

Impacted Files	Coverage Δ
python/cucim/src/cucim/skimage/color/colorconv.py	`97.97% <ø> (ø)`
...thon/cucim/src/cucim/skimage/filters/_fft_based.py	`92.00% <50.00%> (+0.16%)`	⬆️
...on/cucim/src/cucim/skimage/filters/_median_hist.py	`81.40% <83.33%> (-0.22%)`	⬇️
...cucim/src/cucim/skimage/measure/_colocalization.py	`86.20% <86.20%> (ø)`
...m/skimage/registration/_phase_cross_correlation.py	`94.82% <91.48%> (-1.69%)`	⬇️
python/cucim/src/cucim/skimage/filters/_median.py	`81.96% <91.66%> (+0.48%)`	⬆️
...im/src/cucim/core/operations/morphology/_pba_2d.py	`92.08% <100.00%> (+0.03%)`	⬆️
...im/src/cucim/core/operations/morphology/_pba_3d.py	`95.79% <100.00%> (+0.01%)`	⬆️
python/cucim/src/cucim/skimage/_shared/utils.py	`77.40% <100.00%> (+0.58%)`	⬆️
...hon/cucim/src/cucim/skimage/exposure/_adapthist.py	`97.45% <100.00%> (+0.02%)`	⬆️
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

gigony

Looks good to me!

There is a typo but I think it can be corrected later if we don't want to spend extra time triggering CI/CD again.

python/cucim/src/cucim/skimage/feature/tests/test_blob.py

This version provides faster, elementwise kernel implementations for common padding modes. This version should also be submited upstream to CuPy itself.

apply isort fix typo

jakirkham · 2023-02-02T08:38:19Z

/merge

grlee77 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change performance Performance improvement labels Jan 24, 2023

grlee77 requested a review from a team as a code owner January 24, 2023 18:33

gigony added this to the v23.02.00 milestone Jan 24, 2023

gigony approved these changes Feb 2, 2023

View reviewed changes

python/cucim/src/cucim/skimage/feature/tests/test_blob.py Outdated Show resolved Hide resolved

grlee77 added 2 commits February 2, 2023 01:40

use vendored version of pad instead of cupy.pad

72a64da

This version provides faster, elementwise kernel implementations for common padding modes. This version should also be submited upstream to CuPy itself.

flake8 fix

b557cf8

apply isort fix typo

grlee77 force-pushed the pad-elementwise branch from ab5a4db to b557cf8 Compare February 2, 2023 06:40

Merge branch 'branch-23.02' into pad-elementwise

0c84ca2

rapids-bot bot merged commit 7fd07d0 into rapidsai:branch-23.02 Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use vendored version of cupy.pad with added performance optimizations #482

use vendored version of cupy.pad with added performance optimizations #482

grlee77 commented Jan 24, 2023

codecov-commenter commented Feb 1, 2023 •

edited

Loading

gigony left a comment

jakirkham commented Feb 2, 2023

use vendored version of cupy.pad with added performance optimizations #482

use vendored version of cupy.pad with added performance optimizations #482

Conversation

grlee77 commented Jan 24, 2023

Overview

Benchmark Results (vs. cupy.pad)

codecov-commenter commented Feb 1, 2023 • edited Loading

Codecov Report

gigony left a comment

Choose a reason for hiding this comment

jakirkham commented Feb 2, 2023

Benchmark Results (vs. `cupy.pad`)

codecov-commenter commented Feb 1, 2023 •

edited

Loading