Multithreaded Inference Support #16654

anirudh2290 · 2019-10-28T08:16:40Z

Description

To get some background on the PR, please go through the related github issue : #16431 . Below, I have added a brief overview for the PR.

Brief Overview

This PR adds a multithreaded inference interface (EXPERIMENTAL) for MXNet. The approach it takes for this is to add a thread safe version of cached op with minimal set of features required for inference.
Adds C API to create cached op thread safe version: The API: CreateCachedOpEX takes an additional parameter thread_safe. When thread_safe is set to true, invokes on the cached op handle will invoke the thread safe version of cached op.

  /*!
   * \brief create cached operator, allows to choose thread_safe version
   * of cachedop
   */
  MXNET_DLL int MXCreateCachedOpEX(SymbolHandle handle,
                                   int num_flags,
                                   const char** keys,
                                   const char** vals,
                                   CachedOpHandle *out,
                                   bool thread_safe DEFAULT(false));

This PR adds tests and verification for thread safety of engine, for running parallel inference with the thread safe version of CachedOp with different models and different number of threads and different number of inferences per thread.
Model Testing has been limited to : resnet18, resnet50_v1, resnet152_v1
Adds CI stages and CI tests for running the thread safety tests. Please see : capi-cpp-package GPU in unix-gpu stage.

Feedback/Acknowledgements

Below is the feedback/help received from offline code/design reviews:

@nswamy - Early inputs and offline design discussions related to this project, specifically with respect to dmlc::ThreadLocalStore.
@frankfliu - Explaining the requirements and use case for MXNet backend from a customer perspective, code review suggestions to have thread safe API for Create only and to determine whether to invoke cached op/cached op threadsafe based on handle.
@andreaolgiati - Suggestion to document current threading model of MXNet (still a TODO), requesting adding support for dynamic shapes for NLP case (not supported yet), suggested documenting the long term plan with respect to thread safe inference in mxnet. (Documented in limitations and Future work section of the doc, will add required github issues).
@lanking520 , @larroy , @zachgk - Suggestion to add a new API for Threadsafe cached op version, since modifying existing one will break backward compatibility.
@frankfliu and @oorqueda - Suggestion for continuous maintenance (addition of support) of the models supported table. Will be maintained on the doc and RFC issue.
@samskalicky - offline code review suggestions with respect to code comments for locking strategy and comments for changes made for thread safe cached op.
@Vikas-kum - better document the performance benefits. [still a TODO]

Thanks to all above for the suggestions, review and feedback.

Thanks to @marcoabreu and @arcadiaphy for the valuable comments on the RFC.

Additionally, thanks to @roywei, @access2rohit , @stu1130 , @ChaiBapchya , @rondogency , @yuxihu, @vrakesh, @keerthanvasist, @josephevans for code/design review.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

include/mxnet/c_api.h

…ge changes, CI changes and tests

Fix indentation

anirudh2290 · 2020-01-16T08:02:31Z

On the other hand, I am not sure if we can accept this cached op without bulking and subgraph support. Subgraph is used for multiple accelerators for MXNet inference. Also, bulking is essential to inference speed on GPUs. For instance, we saw 20% perf regression if bulking scope is smaller than the optimal scope #9055 and it would be even more significant if no bulking is enabled at all. I am not sure if this can be shipped with these unknown limitations. Thinking about how to better support thread safe bulking and subgraph may affect the implementation and design.

I have added bulking support.
For subgraphing, If I am correct, MXNet still doesnt support subgraph API with gluon. It is still supported only with symbol and MXNET_SUBGRAPH_BACKEND env variable works only with graph executor. What I meant to say is that the param subgraph in CachedOp is not supported. Now you can use a symbol which has already been converted and replaced with subgraph ops with the cached_op_threadsafe version. I have added a test to demonstrate that. I think this should address your concern with respect to subgraphing.

Do we have evidence that the newly introduced cached op is performant? I'm concerned about the mutex for the whole forward function - that means no thread can concurrently push operations to the engine. What I observed is - that could take a long time depending on the model used.

Although the mutex in forward is just pushing the ops, which may have delays between threads, the execution of these ops will be scheduled in parallel, especially when there is no dependency from the ops different threads and should improve performance.

https://github.com/awslabs/djl did some benchmarks and were able to see improved throughput upto 2x with naive engine. I was also able to see performance improvements upto 1.7x when you increase the number of worker threads on the threaded engine for cpu.

anirudh2290 · 2020-01-23T17:16:28Z

I have addressed all the comments here. Please let me know if you have additional comments.

eric-haibin-lin

@TaoLv @PatricZhao could you help review this PR, since it's related to inference?

pengzhao-intel · 2020-01-31T02:31:49Z

We're on the long vocation and will back to office in 10th Feb.

cjolivier01

Thanks so much for doing this! Badly needed feature!

anirudh2290 · 2020-01-31T04:41:49Z

Thanks a lot for your kind words Chris! Appreciate your dedication towards the mxnet community!

cjolivier01 · 2020-01-31T04:59:43Z

hahaha...

…

On Thu, Jan 30, 2020 at 8:41 PM Anirudh Subramanian < ***@***.***> wrote: Thanks a lot for your kind words Chris! Appreciate your dedication towards the mxnet community! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16654>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACVWZ7JKKZD7RO4IKJUZHFDRAOTZBANCNFSM4JFW2Z2Q> .

larroy · 2020-01-31T22:17:11Z

Did you run performance benchmarks to see that there's no regression? as I understand this is a change that could have an impact on performance in the engine. Would be useful to run the usual models we have in the performance dashboards.

leezu · 2020-02-04T20:20:39Z

@anirudh2290 this seems to have broken C++ examples. See #17514

eric-haibin-lin · 2020-02-06T21:31:19Z

@zhreshold FYI

* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Add CI changes * Add stage Fix indentation * Fix lint * Change to DEFAULT for C API * Fix mxnet_unit_tests path * export correct LD_LIBRARY_PATH * Add cpp include dirs * Build test with USE_CPP_PACKAGE * Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests * Fix download cmd in runtime_functions * Merge * change mkldnn lib name * Add static_alloc, static_Shape support * Address review comments * Make GetCachedOpThreadSafeState similar to cached_op * Address review comments: comments for locking strategy * multithreaded inference tutorial * [Estimator] handle composite metrics in estimator (apache#16676) * handle composite metrics in estimator * fix composite metric case in handlers * remove unused import * [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (apache#16678) * refactor estimator to allow overriding evaluate/fit of a batch * add doc to explain call structure and how to override * fix and doc * Pointwise fusion for GPU (apache#15167) * Beginning of RTC of pointwise ops * Code generation from the given JSON * add initial simple_partition_pass and use it for pointwise fusion * fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code * Fixes * Adding support for attribute inference for backward nodes when fusing * keep proper input ordering for fused Op * instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph * Fuse backward * fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch * excluse forward node fusion during the fusion of the nodes in the backward graph * Dealing with fused backward nodes inferattr * use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph * Adding support for other reqs in codegen * Fix * Cleaning * Change the TVM submodule * More cleaning * Making linter happy * Do fusion only if default context is GPU * Fixes for tests Add powerscalar and rpowerscalar, fix return type of zero and one Cleaning, fixing lint Go back to proper TVM submodule * Fix the TVM commit * Fix lint * Guard fusion with MXNET_USE_CUDA * Fix * Fix clang-tidy * Add erf and erfinv backward * Gluon support for fusion * Cleaning * Cleaning and allow shape/type change in FusedOp * Fixing Gluon bugs * Fixing after rebase * Fixing race condition and guarding against races when using NVRTC * Cleaning and renaming FusedOp to _FusedOp * Going easy on Windows compiler * Disable fusion on Windows for now * Refactor InferAttr and InferShapeAttr * Added slice and half2 support to FusedOp * Fix lint errors * Added multiple types support for vector loading/storing * add slice fusion when it's at the beginning of subgraphs * Removed constant ndim assumption in fused op * Fix memory alignment issue in slice for FusedOp * Fixes * Fix lint errors * Do not include cuda_fp16.h * Refactor fused op op lists * Make linter happy * Changes from review * Fixes after rebase * Expand FusedOp support for slice * Fix for fp16 _zeros and _ones * Fix * Moving aux functions to unnamed namespace and detail namespace -> fusion namespace * Disabling fusion if it alters topological order of inputs * Print code only when env variable is set * Fix * Fix lint and 2 tests that specify the same names for multiple inputs * Fixes from review and disabling fusion of slice with non-default step * Add amp_cast to fusion, fixes * Add amp_multicast and its backward to the list of support ops * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <markhama@amazon.com> * Make clearer comment * Adding punctuation and capitalization to \brief descriptions * Fix * Fix * Add backward_cast to fusion * Adding unittests for fusion. Fix for erfinv_grad * Adding slice ops and add_n to tests * Fixes from review * Setting inplace option * Fix lint * Storing double in half * Retrigger CI * Slight relaxing of the relative tolerance in the test * Move the env variable check to the end * Fix a race condition between InferShape and scheduled Forward * Fix flakey test_fusion test involving fp32 erfinv op. * Fix from review * Added broadcast_like and slice_like to fused op * Minor fix and cleanup * Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like * Added axes support to slice_like * Added axis support to broadcast_like * Add fast_load_slice function to fused op code * Added runtime switch for choosing fast and slow slice kernel * Fix lint and warning * Going easy on Windows compiler (again) * Fix slice_like * Debug broadcast_like fusion * Fix lint * Fix lint * Trigger CI * Get rid of the initializer list * Fix backward calls with different gradient type * avoid cycle when adding node specific for inputs of subgraph for pointwise fusion * Fix lint * Add namespace to the fusion implementations * Set launch bounds on the fused kernel * Fix NumPy tests * Test showcasing an issue fixed in PR apache#16553 * Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b) Fix lint errors Fix lint * Fix a bug in cycle detection for inputs only op in pointwise fusion * Add comments to simple_partition_pass.h file * fix install dir (apache#16690) * [numpy] add numpy operator : append (apache#16564) * add operator : append ; fix op concatenate when axis = None * pylint disable remove mistake disable pylint * Initializer.__eq__ (apache#16680) * fix binary dependencies in CD and nightly (apache#16693) * [MKL-DNN] Add mxnet mkldnn cmake tutorial (apache#16688) * add mxnet mkldnn cmake instruction * imporve doc * OMP->OpenMP * Revert "[MKLDNN]Fix reorder2default (apache#16602)" (apache#16697) This reverts commit dd4eaf5. * [Estimator] refactor estimator and clarify docs (apache#16694) * refactor estimator and clarify docs * fix info message and test * clean up after releasing logging handler * Eliminate common expressions (apache#15657) * Eliminate common expressions from a graph * Guarding against optimizing out stateful ops and ops that require resource * Fix lint * Added THasDeterministicOutput to multiple ops * DDebug eliminate common expr * Added test * Expose get_optimized_symbol * Fix * Fix 2 * Add doc to the Python call * Add env var MXNET_ELIMINATE_COMMON_EXPR, default true * Add comments, improve readability of eliminate_common_expr_pass.cc * Expand testing * Lower priority of THasDeterministicOutput attr for equal Node test * Change mx.gpu() to mx.cpu() in tests * Skip CSE test on Windows (as env variable setting during test does not work there) * Add missing import sys * Add missing import logging * Backport of apache#16711, apache#16737, apache#16408 to 1.6 branch (apache#16763) * support mixed-precision true_divide (apache#16711) * [MKLDNN] use dim_t instead of int in slice/transpose operators (apache#16737) * use dim_t instead of int * fix same issue in pooling * rebase code * trigger CI * Add MXNet Ops for fast multihead attention (apache#16408) * add MXNet Ops for fast multihead attention * add cutlass as 3rdparty dependency * add cutlass to compilation flags * remove all cutlass stuff * add better error message and description and remove cutlass from compilation flags * change credit for the approach since the code have changed * fix typos * correct another typo * Add all the cuda/cublas helper functions * remove tests using kAddTo * only use cublasStridedBatchedGemm if CUDA >= 9.1 * add equivalent mxnet code in description of mha ops * remove a wrong copy-paste * add _contrib for namespace and add GPU only on description * add warning in bwd_ignore_zero_init description, also test with fp32 * add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO * remove std::move for clang * remove bwd_ignore_zero_init flag * remove bwd_ignore_zero_init in test_operator_gpu.py * fix typo * fix another typo * Removed unrelated test * Add example and documentation for multi threaded inference * Add LICENSE * Add get_model.py * Add license for README * Refactor cached op and cached op threadsafe * Add limitation * Add tests for naive engine * Add latest test changes * Thread Safety tests in NaiveEngine mode * Thread Safety tests update * Update thread safety tests, add unsupported use cases * Changes to doc and refactor * Fix todo owner, indentation and mx_float->float * Refactor cached op code, remove num_threads arg from example * Fix lint * Fix warning * Add back cython, required for unix-gpu build * Fix for windows * Add bulking support for thread safe cached op version * Add support for subgraph testing * import mxnet before calling get_backend_symbol * Fix symbol json name * Refactor DynamicForward * Add comments * Add DMLC_ATTRIBUTE_UNUSED * Fix use_naive_run issue * Fix lint * Revert unittest_cpp to old test since it doesnt test thread safety * Fix doc Co-authored-by: Sheng Zha <szha@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com> Co-authored-by: Leonard Lausen <leonard@lausen.nl> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>

anirudh2290 force-pushed the multithreaded_inference_backend_support branch 9 times, most recently from defe1da to ccd2d7d Compare October 29, 2019 23:00

zachgk reviewed Oct 29, 2019

View reviewed changes

include/mxnet/c_api.h Outdated Show resolved Hide resolved

anirudh2290 force-pushed the multithreaded_inference_backend_support branch 7 times, most recently from 4c8e3ad to 66d51bc Compare October 30, 2019 05:59

anirudh2290 added 4 commits October 30, 2019 18:36

Add cached op threadsafe version with corresponding C APIs, CPP Packa…

a6c95ef

…ge changes, CI changes and tests

Fix download cmd in runtime_functions

5304b7c

Add CI changes

9e3eced

Add stage

1359ec8

Fix indentation

anirudh2290 force-pushed the multithreaded_inference_backend_support branch from 66d51bc to 1359ec8 Compare October 30, 2019 20:46

anirudh2290 added 2 commits October 30, 2019 21:07

Fix lint

b9b4b94

Change to DEFAULT for C API

6e8ff59

anirudh2290 mentioned this pull request Oct 31, 2019

Compilation fails in master Cuda 10.1.105 GCC 7.4 Ubuntu 18.04 #16612

Open

Fix mxnet_unit_tests path

58a0790

anirudh2290 force-pushed the multithreaded_inference_backend_support branch from 0d765b4 to 58a0790 Compare October 31, 2019 05:04

anirudh2290 added 3 commits October 31, 2019 06:55

export correct LD_LIBRARY_PATH

24a888d

Add cpp include dirs

76b5076

Build test with USE_CPP_PACKAGE

29ad64f

anirudh2290 added 2 commits January 16, 2020 02:49

Add bulking support for thread safe cached op version

0b91267

Add support for subgraph testing

edd4fdf

anirudh2290 added 3 commits January 16, 2020 18:10

import mxnet before calling get_backend_symbol

8e7e085

Fix symbol json name

800847d

Refactor DynamicForward

36ae782

anirudh2290 force-pushed the multithreaded_inference_backend_support branch from e74ac4d to 36ae782 Compare January 18, 2020 03:12

anirudh2290 added 5 commits January 21, 2020 23:19

Add comments

d942e0c

Add DMLC_ATTRIBUTE_UNUSED

2524a24

Fix use_naive_run issue

231f7b1

Fix lint

a663063

Revert unittest_cpp to old test since it doesnt test thread safety

ad90150

Fix doc

662ab93

eric-haibin-lin reviewed Jan 24, 2020

View reviewed changes

cjolivier01 approved these changes Jan 31, 2020

View reviewed changes

Merge

7288128

anirudh2290 merged commit b1e4911 into apache:master Feb 1, 2020

leezu mentioned this pull request Feb 4, 2020

C++-Package examples broken #17514

Closed

leezu mentioned this pull request Feb 8, 2020

Fixcmakegpubuild #17552

Merged

samskalicky mentioned this pull request Feb 13, 2020

Add deferred compute support #17530

Merged

5 tasks

DickJC123 mentioned this pull request Sep 3, 2020

Multi-threaded inferencing support now lacking a unittest? #19082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded Inference Support #16654

Multithreaded Inference Support #16654

anirudh2290 commented Oct 28, 2019 •

edited

Loading

anirudh2290 commented Jan 16, 2020

anirudh2290 commented Jan 23, 2020

eric-haibin-lin left a comment

pengzhao-intel commented Jan 31, 2020

cjolivier01 left a comment

anirudh2290 commented Jan 31, 2020

cjolivier01 commented Jan 31, 2020 via email

larroy commented Jan 31, 2020

leezu commented Feb 4, 2020

eric-haibin-lin commented Feb 6, 2020

Multithreaded Inference Support #16654

Multithreaded Inference Support #16654

Conversation

anirudh2290 commented Oct 28, 2019 • edited Loading

Description

Brief Overview

Feedback/Acknowledgements

Checklist

Essentials

anirudh2290 commented Jan 16, 2020

anirudh2290 commented Jan 23, 2020

eric-haibin-lin left a comment

Choose a reason for hiding this comment

pengzhao-intel commented Jan 31, 2020

cjolivier01 left a comment

Choose a reason for hiding this comment

anirudh2290 commented Jan 31, 2020

cjolivier01 commented Jan 31, 2020 via email

larroy commented Jan 31, 2020

leezu commented Feb 4, 2020

eric-haibin-lin commented Feb 6, 2020

anirudh2290 commented Oct 28, 2019 •

edited

Loading