Accelerate the performance of topk for CPU side #12085

ciyongch · 2018-08-08T12:15:33Z

Description

Optimize the performance of topk algorithm for CPU side, which mentioned in issue (#10205)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

topk

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

@pengzhao-intel @xinyu-intel

asmushetzel · 2018-08-08T21:43:52Z

Can you please elaborate a bit

what speedups do you see. I see that there are some numbers in [Operator] Accelerate the CPU side performance of topk #10205 but they all are for a single case k=3. We should make sure that other cases (k=N/2 etc) are also good.
what actually contributes to the speedup (it is not clear from the code what is it that makes it more efficient)
It would be nice if you can also leave some comments in the code that indicates this in order to help other people understand why certain things are done the way they are now.

ciyongch · 2018-08-09T07:00:03Z

@asmushetzel thanks for your comments.

Actually, I did some other tests with different configuration when calling topk Op, which might help to understand how much improvement we could achieve with this enhancement. Here's more performance data I collected between optimized version and out-of-box version. The number in speedup columns are speedup factor, the bigger the better. We can see that all the cases shows good, especially the case when the axis=3(equal to -1 since the input is 4d ndarray).

		ret_type=value	ret_type=indices	ret_type=both	ret_type=mask
axis	k	speedup	speedup	speedup	speedup
0	1	2.52	2.60	2.49	1.00
0	3	2.54	2.61	2.63	1.04
0	5	2.58	2.54	2.59	1.16
0	10	2.52	2.58	2.58	1.09
1	1	3.72	3.69	3.73	1.04
1	3	3.75	3.77	3.66	1.10
2	1	2.40	2.35	2.30	1.11
2	3	2.34	2.31	2.24	1.10
2	5	2.29	2.30	2.24	1.19
2	10	2.33	2.33	2.34	1.09
2	100	2.36	2.43	2.40	1.10
2	500	2.51	2.49	2.43	1.10
2	1000	2.44	2.37	2.34	1.04
3	1	67.29	70.23	61.32	5.22
3	3	64.53	64.05	61.82	5.32
3	5	61.06	61.72	52.28	4.85
3	10	43.06	50.45	49.08	4.28
3	100	21.83	22.04	21.10	2.44
3	500	10.36	10.35	10.30	1.55
3	1000	8.54	9.10	8.08	1.35

Regarding the code changes, I did two minor changes and kept the main logic same as current version.
- only do the modulus calculation to ret_indices instead of full indices after calling TopkSort() function, which could reduce many redundant calculation when k is smaller than element number. So I changed indices = F<mshadow_op::mod>(indices, element_num); to ret_indices = F<mshadow_op::mod>(ret_indices, element_num);. That's why the results from ret_type=[value, indices, both] shows better than ret_type=mask.
- in the case of axis=-1, there's no need to do the transpose to the source(input) data, but only requires a flatten to 1D operation. FlatTo1D is much efficient than reshape since it only changes the Shape but not result in data movement. I noticed that in current CPU TopkSort() function, the work Tensor is only used to stored the temporary result and then copy its content back to sorted_data. If we pass the flattened data via work Tensor to TopkSort(), and copy the top k number of data from such flattened data into sorted_data directly, then there's no need to keep additional temporal Tensor and do the copy second time. For the case axis!=-1, both transpose and reshape are required to convert a 3D Tensor to 1D Tensor. That's why the results from axis=-1 are much better than other cases.

Hope this answers your question, I could add some comments in the code if there's no other concerns.

pengzhao-intel · 2018-08-10T09:28:47Z

@asmushetzel any other comments?

@marcoabreu @szha could you help take a review and merge the PR? we hope it can be involved into 1.3 and then sockeye can get lots of benefits.

asmushetzel · 2018-08-10T12:57:25Z

@ciyongch Nice work! Reviewed it and looks good to me.

According to the code changes, the massive speedup for axis=3 and ret_type=indices/values/both is mainly attributed to the change concerning the modulo-computation. This means that a single modulo operation on the indices somehow attributed to 90% of the entire runtime, which is pretty insane. Is this a weak spot of the CPU-design? Or because mod-operations do not benefit from AVX (just speculating)? Can you guys from Intel comment on this a bit more?

marcoabreu · 2018-08-10T14:22:55Z

Thanks for this great improvement! Unfortunately, I'm currently quite swamped and unable to review this pull request.

szha · 2018-08-10T17:40:18Z

@asmushetzel thanks for the review. @pengzhao-intel I will take a look shortly.

pengzhao-intel · 2018-08-12T08:41:49Z

@asmushetzel Thanks for the comments.
Actually, the modulo instruction is much slower than ADD, MUL, FMA in the popular hardware, such as CPU and GPU. In practice, we have to avoid the redundant modulo operation from the software side :)

@szha, Thanks, btw, your new head portrait is really cool and looks like the big boss :)

ciyongch · 2018-08-12T15:27:53Z

@asmushetzel Sorry for the late response, as @pengzhao-intel mentioned above, the modulo and division instruction are not efficient as ADD/MUL. For topk Op, both transpose and modulo are most time-consuming operations, and then the reshape operation, while the TopkSort() function depends on the value of k.
In the case of axis=3 and k=1, it's no need to do transpose, so the modulo operation could take most of the time, but its proportion will decrease as k grows.
In the case of ret_type=value/indices/both and axis=0/1/2, both transpose and modulo are executed, after removing redundant modulo operation, we can see there's some speedup but not much as axis=3 due to another time-consuming operation transpose.
Since many of the neural networks use the case of axis=3, they will get benefit a lot for this improvement.

szha

LGTM. We may want to add dtype template to topk when there are relevant use cases.

ciyongch · 2018-08-13T04:17:11Z

@szha Thanks for your review :)
Once there's dtype template for topk, then the temp_workspace Tensor on CPU could be simplified.

Since MXNet's topk has better performance than numpy version with PR apache/mxnet#12085, in order to leverage such performance boost, change to use MXNet's topk for CPU device when doing inference.

@vandanavk

adding tutorial index pages to whitelist added custom fork feature adding settings to turn off/on doc sets using custom fork directory for artifacts automate upstream branch refresh switched to boolean types and added debug messaging build will copy current config files to each version build build will copy current config files to each version build stashing config files before checking out new version put mxnet.css as artifact to be copied during build fix formatting issues in h tags refactored to build each version in a different folder grab latest README from local fork using settings.ini for document sets per version fix R doc config for mxnet root matching conf.py updates to current and excluding 3rdparty folder align R doc gen bug fix with other PR 11970 pass the current tag in the make args and set to default if empty fix bug for default version and add BUILD_VER to make html call turning off scala docs for versions less than 1.2.0 turning off r docs until CI can handle it enabling new docs build capability in CI failover to fetching remote branch Remove stale Keras-MXNet tests from MXNet repo (apache#11902) Disable flaky cpp test (apache#12056) Adjusting tolerance level and removing fixed seed for tests: test_ifft, test_fft (apache#12010) * adjusting tolerance level and removing fixed seed * CI retrigger * removing status [MXNET-774] Flaky test in test_executor.py:test_bind (apache#12016) * fix test bind, remove fixed seed * add tracking info * remove tracking info fix flaky test_quantization.test_get_optimal_thresholds (apache#12004) removed fixed seed 1234 (apache#12072) tested with 100k runs, no failures improve error message of cudnn operators (apache#11886) Fix for undefined variable errors (apache#12037) * Undefined name in initializer * Fix undefined name in test_mkldnn * Fix for undefined names in examples Fix undefined_variable lint errors in examples (apache#12052) * Fix lint errors in dqn example * Fix lint error in gluon example * Fix undefined error in autoencoder example MXNET-776 [Perl] Better documentation/bug fixes. (apache#12038) * MXNET-776 1) Several new metric classes. 2) Improved documentation. 3) Bugfixes. * added links and fixed a typo. Redesign Jenkinsfiles (apache#12000) * Rework Jenkinsfile * Add functionality to assign node labels dynamically * Extract functions into util file * Change all Jenkinsfiles to use utils * Make a new commit... * Address review comments 1 * Address review comments 2 fix unidirectional model's parameter format (apache#12055) * fix unidirectional model's parameter format * Update rnn_layer.py Fix syntax errors in Jenkinsfiles (apache#12095) [MXAPPS-581] Straight Dope nightly fixes. (apache#11934) Enable 3 notebooks that were failing tests after making updates to the Straight Dope book. We also add pandas required by one of these notebooks. Fix jenkinsfile syntax errors (apache#12096) remove fixed seed for test_triplet_loss (apache#12011) got rid of fixed seed for test_optimizer/test_operator_gpu.test_ftml (apache#12003) [MXNET-696] Fix undefined variable errors (apache#11982) * Fix undefined error in image segmentation ctx is used undefined. Setting the default ctx to cpu and editing the comment to let the user know that it can be changed to GPU as required. * Fix undefined names in SSD example maskUtils is disabled. Remove code referencing it. Initializing start_offset. got rid of fixed seed for test_optimizer/test_operator_gpu.test_nag (apache#11981) Fix flaky test for elementwise_sum (apache#11959) Re-enabling test_operator.test_binary_math_operators (apache#11712) (apache#12053) Test passes on CPU and GPU (10000 runs) update docs to explain CPU incompatibilities (apache#11931) removed fixed from test_optimizer.test_signum (apache#12088) Add missing object to tests/nightly/model_backwards_compatibility_check/JenkinsfileForMBCC (apache#12108) Add GetName function in Symbol class for cpp pack (apache#12076) Add unique number of parameters to summary output in Gluon Block (apache#12077) * add unique parameters in summary output * rebuild Update fully_connected.cc documentation (apache#12097) [MXNET-244] Update RaspberryPI instructions (apache#11562) * Update RaspberryPI instructions [MXNET-749] Correct usages of `CutSubgraph` in 3 control flow operators (apache#12078) * Fix cut graph * Copy only when necessary * Add unittest for while_loop * Add unittest for foreach * Add unittest for cond * Avoid magic number: 0 => kUndefinedStorage [MXNET-703] TensorRT runtime integration (apache#11325) * [MXNET-703] TensorRT runtime integration Co-authored-by: Clement Fuji-Tsang <caenorst@hotmail.com> Co-authored-by: Kellen Sunderland <kellen.sunderland@gmail.com> * correctly assign self._optimized_symbol in executor * declare GetTrtCompatibleSubsets and ReplaceSubgraph only if MXNET_USE_TENSORRT * add comments in ReplaceSubgraph * Addressing Haibin's code review points * Check that shared_buffer is not empty when USE_TENSORRT is set * Added check that TensorRT binding is for inference only * Removed redundant decl. * WIP Refactored TRT integration and tests * Add more build guards, remove unused code * Remove ccache report * Remove redundant const in declaration * Clean Cmake TRT files * Remove TensorRT env var usage We don't want to use environment variables with TensorRT yet, the logic being that we want to try and have as much fwd compatiblity as possible when working on an experimental feature. Were we to add env vars they would have to be gaurenteed to work in the future until a major version change. Moving the functionality to a contrib call reduces this risk. * Use contrib optimize_graph instaed of bind * Clean up cycle detector * Convert lenet test to contrib optimize * Protect interface with trt build flag * Fix whitespace issues * Add another build guard to c_api * Move get_optimized_symbol to contrib area * Ignore gz files in test folder * Make trt optimization implicit * Remove unused declaration * Replace build guards with runtime errors * Change default value of TensorRT to off This is change applies to both TensorRT and non-TensorRT builds. * Warn user when TRT not active at runtime * Move TensorRTBind declaration, add descriptive errors * Test TensorRT graph execution, fix bugs * Fix lint and whitespace issues * Fix typo * Removed default value for set_use_tensorrt * Improved documentation and fixed spacing issues * Move static exec funcs to util files * Update comments to match util style * Apply const to loop element * Fix a few namespace issues * Make static funcs inline to avoid compiler warning * Remove unused inference code from lenet5_train * Add explicit trt contrib bind, update tests to use it * Rename trt bind call * Remove documentation that is not needed for trt * Reorder arguments, allow position calling Decrease success rate to make test more stable (apache#12092) I have added this test back to unit test coverage and decreased success rate even more, to make sure that fails would happen even more rare Add Clojure to website nav (apache#12075) * adding clojure to API navigation * adding clojure to the sidebar * switched order Fix flaky tests for quantize and requantize (apache#12040) [MXNET-703] Use relative path for symbol import (apache#12124) Fix shared memory with gluon dataloader, add option pin_memory (apache#11908) * use threading for mp dataloader fetching, allow pin_memory option * allow pin tuple of data into cpu_pinned * fix as_in_context if not cpu_pinned * fix cpu_pinned * fix unittest for windows, update doc that windows mp is available * fix pin_memory * fix lint * always use simplequeue for data queue * remove main thread clearing for data_queue * do not use outside folder as pythonpath but run nosetests inside * use :MXNET_LIBRARY_PATH= to locate dll * fix dll path * correct dll path reduce a copy for rowsparse parameter.reduce (apache#12039) GPU Memory Query to C API (apache#12083) * add support for GPU memory query * remove lint take custom dataset into consideration (apache#12093) [MXNET-782] Fix Custom Metric Creation in R tutorial (apache#12117) * fix tutorial * install instructions * fix typo [MXAPPS-805] Notebook execution failures in CI. (apache#12068) * [MXAPPS-805] Notebook execution failures in CI. * Add a retry policy when starting a notebook executor to handle the failure to start a notebook executor (due to a port collision, kernel taking too long to start, etc.). * Change logging level for tests to INFO so that we have more informative test output. * Make retry logic for Jupyter notebook execution specific to the error message we are looking for to prevent false positives in the retry logic. rm wrong infertype for AdaptiveAvgPool and BilinearReisze2D (apache#12098) Document MXNET_LIBRARY_PATH environment variable which was not documented explicitly. (apache#12074) Generalized reshape_like operator (apache#11928) * first commit * fix documentation * changed static_cast<bool>(end) to end.has_value() fixed documentation issues * change begin from int to optional * test None as lhs fix cython nnvm include path (apache#12133) CI scripts refinements. Separate Py2 and Py3 installs cripts. Fix perms. (apache#12125) zipfian random sampler without replacement (apache#12113) * code compiles * update doc * fix bug and add test * fix lint update dmlc-core (apache#12129) Fix quantized graphpass bug (apache#11937) * fix quantized graphpass bug * add residual quantization testcase * handle dtype and backend issues support selu activation function (apache#12059) Fix flaky test test_operator_gpu:deformable_conv and deformable_psroi_pooling (apache#12070) [MXNET-767] Fix flaky test for kl_loss (apache#11963) * Fix flaky test for kl_loss * remove comment. [MXNET-788] Fix for issue apache#11733 pooling op test (apache#12067) * added support to check_consistency function to generate random numbers for a specific datatype (ie. fp16) this ensures that for tests that compare results among different precisions, that data is generated in the least precise type and casted to the most precise changed test_pooling_with_type test case to specify fp16 precision for random input data renamed the 2nd test_pooling_with_type function to test_pooling_with_type2 so it doesnt redefine the first and both are tested fixed equation formatting issue in pooling operator description Added myself to the contributors readme file * updated from latest in master (had old version of the file) * shortened lines per lint spec * renamed default_type argument to rand_type for clarity updated function docstring with argument description removed rand_type setting for non-max pooling tests * cleaned up check_consistency function docstring Do not show "needs to register block" warning for registered blocks. (apache#12130) Fix precision issue of test case test_rnnrelu_bidirectional (apache#12099) * adjust tolerance only for relu for fixing test case bug * only adjust torence for test_rnnrelu_bidirectional and adjust back on test_rnnrelu_sym Accelerate the performance of topk for CPU side (apache#12085) * Accelerate the performance of topk for CPU side * Add comments for the code changes Remove unused TensorRT code (apache#12147) Removing some python code that isn't in the current TensorRT execution paths. This should make the code more readable and avoid potential linting errors. Thanks to @vandanavk for pointing out the dead code and @cclauss for a quick alternative fix. Co-authored-by: Vandana Kannan <vandanavk@users.noreply.github.com> Co-authored-by: cclauss <cclauss@bluewin.ch> Disable test_io.test_CSVIter (apache#12146) Fix RAT license checker which is broken in trunk (apache#12148) Remove obsolete CI folder set bind flag after bind completes (apache#12155) Fix MXPredReshape in the c_predict_api (apache#11493) * Fix MXPredReshape in the c_predict_api. * Add unittest for the C predict API. * Fix path in the test. * Fix for Windows. * Try again to fix for Windows. * One more try to fix test on Windows. * Try again with CI. * Try importing from mxnet first if cannot find the amalgamation lib. * Add a log message when libmxnet_predict.so is not found. * Set specific rtol and atol values. * Fix missing rtol and atol values. * Empty commit. * Try again with CI. * One more try with CI. * Retry CI. [Flaky Test] Fix test_gluon_model_zoo.test_models when MXNET_MKLDNN_DEBUG=1 (apache#12069) * reorder inputs * use function flatten vs build in method * update similar array atoi to 0.01 * fix reorder * enable MXNET_MKLDNN_DEBUG in CI * add exclude debug flag * fix lint * add warning log for excluded op * retrigger RAT check readme updated (apache#12170) update ndarray stack Doc for apache#11925 (apache#12015) * update ndarray stack Doc Add worker_fn argument to multiworker function (apache#12177) * add worker_fn argument to multiworker function * fix pylin Remove fixed seed for test_huber tests (apache#12169) Removed fixed seed and increased learning rate and tolerance for test_nadam (apache#12164) documentation changes. added full reference (apache#12153) * documentation changes. added full reference * fixing lint * fixing more lint * jenkins * adding the coding line utf-8 Partially enable flaky test for norm operator (apache#12027) add examples for slicing option (apache#11918) Module predict API can accept NDArray as input (apache#12166) * forward and predict can accept nd.array np.array [MXNET-744] Docs build tools update (apache#11990) [MXNET-744] Docs build tools update (apache#11990) [MXNET-696] Fix undefined name errors (apache#12137) * Fix undefined name error in neural style example * Fix import exception error * Fix undefined name in AUCMetric * Fix undefined name in a3c example Fix profiler executer when memonger is used (apache#12152) add handling for grad req type other than kNullOp for indices (apache#11983) Fix a minor bug in deformable_im2col.cuh (apache#12060) Function `deformable_col2im_coord ` called deformable_col2im_coord_gpu_kernel but check the deformable_col2im_gpu_kernel. [MXNet-744] Fix website build pipeline Python 3 issues (apache#12195) * Fix website build pipeline Python 3 issues (apache#12195) Fix MKLDNNSum cpp test failure (apache#12080) bump timeout on Jenkins for docs/website to 120 min (apache#12199) * bump timeout on Jenkins to 120 min * add branches to settings using v notation; apply appropiate settings Fixing typo in python/mxnet/symbol/image.py (apache#12194) Fixing typo in python/mxnet/symbol/image.py Fix the topk regression issue (apache#12197) (apache#12202) * Fix the topk regression issue (apache#12197) * Add comments pull changes in from master

* Accelerate the performance of topk for CPU side * Add comments for the code changes

Since MXNet's topk has better performance than numpy version with PR apache/mxnet#12085, in order to leverage such performance boost, change to use MXNet's topk for CPU device when doing inference.

ciyongch requested a review from anirudh2290 as a code owner August 8, 2018 12:15

ciyongch mentioned this pull request Aug 8, 2018

[Operator] Accelerate the CPU side performance of topk #10205

Closed

eric-haibin-lin added Performance Operator pr-awaiting-review PR is waiting for code review labels Aug 9, 2018

ciyongch added 2 commits August 13, 2018 09:47

Accelerate the performance of topk for CPU side

0f4bd8c

Add comments for the code changes

62cbff2

ciyongch force-pushed the master branch from 442a6e8 to 62cbff2 Compare August 13, 2018 01:50

szha approved these changes Aug 13, 2018

View reviewed changes

szha merged commit 95dd95c into apache:master Aug 13, 2018

ciyongch mentioned this pull request Aug 13, 2018

Change to use MXNet's topk for CPUs in inference awslabs/sockeye#506

Closed

8 tasks

leezu mentioned this pull request Aug 16, 2018

topk regression #12197

Closed

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Accelerate the performance of topk for CPU side (apache#12085)

7e456fe

* Accelerate the performance of topk for CPU side * Add comments for the code changes

pengzhao-intel mentioned this pull request Oct 12, 2018

Fix Flaky Topk #12798

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate the performance of topk for CPU side #12085

Accelerate the performance of topk for CPU side #12085

ciyongch commented Aug 8, 2018

asmushetzel commented Aug 8, 2018

ciyongch commented Aug 9, 2018

pengzhao-intel commented Aug 10, 2018 •

edited

Loading

asmushetzel commented Aug 10, 2018

marcoabreu commented Aug 10, 2018

szha commented Aug 10, 2018

pengzhao-intel commented Aug 12, 2018

ciyongch commented Aug 12, 2018

szha left a comment

ciyongch commented Aug 13, 2018

Accelerate the performance of topk for CPU side #12085

Accelerate the performance of topk for CPU side #12085

Conversation

ciyongch commented Aug 8, 2018

Description

Checklist

Essentials

Changes

Comments

asmushetzel commented Aug 8, 2018

ciyongch commented Aug 9, 2018

pengzhao-intel commented Aug 10, 2018 • edited Loading

asmushetzel commented Aug 10, 2018

marcoabreu commented Aug 10, 2018

szha commented Aug 10, 2018

pengzhao-intel commented Aug 12, 2018

ciyongch commented Aug 12, 2018

szha left a comment

Choose a reason for hiding this comment

ciyongch commented Aug 13, 2018

pengzhao-intel commented Aug 10, 2018 •

edited

Loading