Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Multithreaded Inference Support #16654

Merged

Conversation

anirudh2290
Copy link
Member

@anirudh2290 anirudh2290 commented Oct 28, 2019

Description

To get some background on the PR, please go through the related github issue : #16431 . Below, I have added a brief overview for the PR.

Brief Overview

  1. This PR adds a multithreaded inference interface (EXPERIMENTAL) for MXNet. The approach it takes for this is to add a thread safe version of cached op with minimal set of features required for inference.
  2. Adds C API to create cached op thread safe version: The API: CreateCachedOpEX takes an additional parameter thread_safe. When thread_safe is set to true, invokes on the cached op handle will invoke the thread safe version of cached op.
  /*!
   * \brief create cached operator, allows to choose thread_safe version
   * of cachedop
   */
  MXNET_DLL int MXCreateCachedOpEX(SymbolHandle handle,
                                   int num_flags,
                                   const char** keys,
                                   const char** vals,
                                   CachedOpHandle *out,
                                   bool thread_safe DEFAULT(false));
  1. This PR adds tests and verification for thread safety of engine, for running parallel inference with the thread safe version of CachedOp with different models and different number of threads and different number of inferences per thread.
  2. Model Testing has been limited to : resnet18, resnet50_v1, resnet152_v1
  3. Adds CI stages and CI tests for running the thread safety tests. Please see : capi-cpp-package GPU in unix-gpu stage.

Feedback/Acknowledgements

Below is the feedback/help received from offline code/design reviews:

  1. @nswamy - Early inputs and offline design discussions related to this project, specifically with respect to dmlc::ThreadLocalStore.
  2. @frankfliu - Explaining the requirements and use case for MXNet backend from a customer perspective, code review suggestions to have thread safe API for Create only and to determine whether to invoke cached op/cached op threadsafe based on handle.
  3. @andreaolgiati - Suggestion to document current threading model of MXNet (still a TODO), requesting adding support for dynamic shapes for NLP case (not supported yet), suggested documenting the long term plan with respect to thread safe inference in mxnet. (Documented in limitations and Future work section of the doc, will add required github issues).
  4. @lanking520 , @larroy , @zachgk - Suggestion to add a new API for Threadsafe cached op version, since modifying existing one will break backward compatibility.
  5. @frankfliu and @oorqueda - Suggestion for continuous maintenance (addition of support) of the models supported table. Will be maintained on the doc and RFC issue.
  6. @samskalicky - offline code review suggestions with respect to code comments for locking strategy and comments for changes made for thread safe cached op.
  7. @Vikas-kum - better document the performance benefits. [still a TODO]

Thanks to all above for the suggestions, review and feedback.

Thanks to @marcoabreu and @arcadiaphy for the valuable comments on the RFC.

Additionally, thanks to @roywei, @access2rohit , @stu1130 , @ChaiBapchya , @rondogency , @yuxihu, @vrakesh, @keerthanvasist, @josephevans for code/design review.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@anirudh2290 anirudh2290 force-pushed the multithreaded_inference_backend_support branch 9 times, most recently from defe1da to ccd2d7d Compare October 29, 2019 23:00
include/mxnet/c_api.h Outdated Show resolved Hide resolved
@anirudh2290 anirudh2290 force-pushed the multithreaded_inference_backend_support branch 7 times, most recently from 4c8e3ad to 66d51bc Compare October 30, 2019 05:59
@anirudh2290 anirudh2290 force-pushed the multithreaded_inference_backend_support branch from 66d51bc to 1359ec8 Compare October 30, 2019 20:46
@anirudh2290 anirudh2290 force-pushed the multithreaded_inference_backend_support branch from 0d765b4 to 58a0790 Compare October 31, 2019 05:04
@anirudh2290
Copy link
Member Author

On the other hand, I am not sure if we can accept this cached op without bulking and subgraph support. Subgraph is used for multiple accelerators for MXNet inference. Also, bulking is essential to inference speed on GPUs. For instance, we saw 20% perf regression if bulking scope is smaller than the optimal scope #9055 and it would be even more significant if no bulking is enabled at all. I am not sure if this can be shipped with these unknown limitations. Thinking about how to better support thread safe bulking and subgraph may affect the implementation and design.

  1. I have added bulking support.

  2. For subgraphing, If I am correct, MXNet still doesnt support subgraph API with gluon. It is still supported only with symbol and MXNET_SUBGRAPH_BACKEND env variable works only with graph executor. What I meant to say is that the param subgraph in CachedOp is not supported. Now you can use a symbol which has already been converted and replaced with subgraph ops with the cached_op_threadsafe version. I have added a test to demonstrate that. I think this should address your concern with respect to subgraphing.

Do we have evidence that the newly introduced cached op is performant? I'm concerned about the mutex for the whole forward function - that means no thread can concurrently push operations to the engine. What I observed is - that could take a long time depending on the model used.

Although the mutex in forward is just pushing the ops, which may have delays between threads, the execution of these ops will be scheduled in parallel, especially when there is no dependency from the ops different threads and should improve performance.

https://github.com/awslabs/djl did some benchmarks and were able to see improved throughput upto 2x with naive engine. I was also able to see performance improvements upto 1.7x when you increase the number of worker threads on the threaded engine for cpu.

@anirudh2290 anirudh2290 force-pushed the multithreaded_inference_backend_support branch from e74ac4d to 36ae782 Compare January 18, 2020 03:12
@anirudh2290
Copy link
Member Author

I have addressed all the comments here. Please let me know if you have additional comments.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaoLv @PatricZhao could you help review this PR, since it's related to inference?

@pengzhao-intel
Copy link
Contributor

We're on the long vocation and will back to office in 10th Feb.

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for doing this! Badly needed feature!

@anirudh2290
Copy link
Member Author

Thanks a lot for your kind words Chris! Appreciate your dedication towards the mxnet community!

@cjolivier01
Copy link
Member

cjolivier01 commented Jan 31, 2020 via email

@larroy
Copy link
Contributor

larroy commented Jan 31, 2020

Did you run performance benchmarks to see that there's no regression? as I understand this is a change that could have an impact on performance in the engine. Would be useful to run the usual models we have in the performance dashboards.

@anirudh2290 anirudh2290 merged commit b1e4911 into apache:master Feb 1, 2020
@leezu
Copy link
Contributor

leezu commented Feb 4, 2020

@anirudh2290 this seems to have broken C++ examples. See #17514

@eric-haibin-lin
Copy link
Member

@zhreshold FYI

@leezu leezu mentioned this pull request Feb 8, 2020
@samskalicky samskalicky mentioned this pull request Feb 13, 2020
5 tasks
zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests

* Fix download cmd in runtime_functions

* Add CI changes

* Add stage

Fix indentation

* Fix lint

* Change to DEFAULT for C API

* Fix mxnet_unit_tests path

* export correct LD_LIBRARY_PATH

* Add cpp include dirs

* Build test with USE_CPP_PACKAGE

* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests

* Fix download cmd in runtime_functions

* Merge

* change mkldnn lib name

* Add static_alloc, static_Shape support

* Address review comments

* Make GetCachedOpThreadSafeState similar to cached_op

* Address review comments: comments for locking strategy

* multithreaded inference tutorial

* [Estimator] handle composite metrics in estimator (apache#16676)

* handle composite metrics in estimator

* fix composite metric case in handlers

* remove unused import

* [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (apache#16678)

* refactor estimator to allow overriding evaluate/fit of a batch

* add doc to explain call structure and how to override

* fix and doc

* Pointwise fusion for GPU (apache#15167)

* Beginning of RTC of pointwise ops

* Code generation from the given JSON

* add initial simple_partition_pass and use it for pointwise fusion

* fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code

* Fixes

* Adding support for attribute inference for backward nodes when fusing

* keep proper input ordering for fused Op

* instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph

* Fuse backward

* fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch

* excluse forward node fusion during the fusion of the nodes in the backward graph

* Dealing with fused backward nodes inferattr

* use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph

* Adding support for other reqs in codegen

* Fix

* Cleaning

* Change the TVM submodule

* More cleaning

* Making linter happy

* Do fusion only if default context is GPU

* Fixes for tests
Add powerscalar and rpowerscalar, fix return type of zero and one
Cleaning, fixing lint
Go back to proper TVM submodule

* Fix the TVM commit

* Fix lint

* Guard fusion with MXNET_USE_CUDA

* Fix

* Fix clang-tidy

* Add erf and erfinv backward

* Gluon support for fusion

* Cleaning

* Cleaning and allow shape/type change in FusedOp

* Fixing Gluon bugs

* Fixing after rebase

* Fixing race condition and guarding against races when using NVRTC

* Cleaning and renaming FusedOp to _FusedOp

* Going easy on Windows compiler

* Disable fusion on Windows for now

* Refactor InferAttr and InferShapeAttr

* Added slice and half2 support to FusedOp

* Fix lint errors

* Added multiple types support for vector loading/storing

* add slice fusion when it's at the beginning of subgraphs

* Removed constant ndim assumption in fused op

* Fix memory alignment issue in slice for FusedOp

* Fixes

* Fix lint errors

* Do not include cuda_fp16.h

* Refactor fused op op lists

* Make linter happy

* Changes from review

* Fixes after rebase

* Expand FusedOp support for slice

* Fix for fp16 _zeros and _ones

* Fix

* Moving aux functions to unnamed namespace and detail namespace -> fusion
namespace

* Disabling fusion if it alters topological order of inputs

* Print code only when env variable is set

* Fix

* Fix lint and 2 tests that specify the same names for multiple inputs

* Fixes from review and disabling fusion of slice with non-default step

* Add amp_cast to fusion, fixes

* Add amp_multicast and its backward to the list of support ops

* Apply wording suggestions from code review

Co-Authored-By: Aaron Markham <markhama@amazon.com>

* Apply wording suggestions from code review

Co-Authored-By: Aaron Markham <markhama@amazon.com>

* Make clearer comment

* Adding punctuation and capitalization to \brief descriptions

* Fix

* Fix

* Add backward_cast to fusion

* Adding unittests for fusion. Fix for erfinv_grad

* Adding slice ops and add_n to tests

* Fixes from review

* Setting inplace option

* Fix lint

* Storing double in half

* Retrigger CI

* Slight relaxing of the relative tolerance in the test

* Move the env variable check to the end

* Fix a race condition between InferShape and scheduled Forward

* Fix flakey test_fusion test involving fp32 erfinv op.

* Fix from review

* Added broadcast_like and slice_like to fused op

* Minor fix and cleanup

* Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like

* Added axes support to slice_like

* Added axis support to broadcast_like

* Add fast_load_slice function to fused op code

* Added runtime switch for choosing fast and slow slice kernel

* Fix lint and warning

* Going easy on Windows compiler (again)

* Fix slice_like

* Debug broadcast_like fusion

* Fix lint

* Fix lint

* Trigger CI

* Get rid of the initializer list

* Fix backward calls with different gradient type

* avoid cycle when adding node specific for inputs of subgraph for pointwise fusion

* Fix lint

* Add namespace to the fusion implementations

* Set launch bounds on the fused kernel

* Fix NumPy tests

* Test showcasing an issue fixed in PR apache#16553

* Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b)

Fix lint errors

Fix lint

* Fix a bug in cycle detection for inputs only op in pointwise fusion

* Add comments to simple_partition_pass.h file

* fix install dir (apache#16690)

* [numpy] add numpy operator : append (apache#16564)

* add operator : append ; fix op concatenate when axis = None

* pylint disable

remove mistake

disable pylint

* Initializer.__eq__ (apache#16680)

* fix binary dependencies in CD and nightly (apache#16693)

* [MKL-DNN] Add mxnet mkldnn cmake tutorial (apache#16688)

* add mxnet mkldnn cmake instruction

* imporve doc

* OMP->OpenMP

* Revert "[MKLDNN]Fix reorder2default (apache#16602)" (apache#16697)

This reverts commit dd4eaf5.

* [Estimator] refactor estimator and clarify docs (apache#16694)

* refactor estimator and clarify docs

* fix info message and test

* clean up after releasing logging handler

* Eliminate common expressions (apache#15657)

* Eliminate common expressions from a graph

* Guarding against optimizing out stateful ops and ops that require
resource

* Fix lint

* Added THasDeterministicOutput to multiple ops

* DDebug eliminate common expr

* Added test

* Expose get_optimized_symbol

* Fix

* Fix 2

* Add doc to the Python call

* Add env var MXNET_ELIMINATE_COMMON_EXPR, default true

* Add comments, improve readability of eliminate_common_expr_pass.cc

* Expand testing

* Lower priority of THasDeterministicOutput attr for equal Node test

* Change mx.gpu() to mx.cpu() in tests

* Skip CSE test on Windows (as env variable setting during test does not work there)

* Add missing import sys

* Add missing import logging

* Backport of apache#16711, apache#16737, apache#16408 to 1.6 branch (apache#16763)

* support mixed-precision true_divide (apache#16711)

* [MKLDNN] use dim_t instead of int in slice/transpose operators (apache#16737)

* use dim_t instead of int

* fix same issue in pooling

* rebase code

* trigger CI

* Add MXNet Ops for fast multihead attention (apache#16408)

* add MXNet Ops for fast multihead attention

* add cutlass as 3rdparty dependency

* add cutlass to compilation flags

* remove all cutlass stuff

* add better error message and description and remove cutlass from compilation flags

* change credit for the approach since the code have changed

* fix typos

* correct another typo

* Add all the cuda/cublas helper functions

* remove tests using kAddTo

* only use cublasStridedBatchedGemm if CUDA >= 9.1

* add equivalent mxnet code in description of mha ops

* remove a wrong copy-paste

* add _contrib for namespace and add GPU only on description

* add warning in bwd_ignore_zero_init description, also test with fp32

* add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO

* remove std::move for clang

* remove bwd_ignore_zero_init flag

* remove bwd_ignore_zero_init in test_operator_gpu.py

* fix typo

* fix another typo

* Removed unrelated test

* Add example and documentation for multi threaded inference

* Add LICENSE

* Add get_model.py

* Add license for README

* Refactor cached op and cached op threadsafe

* Add limitation

* Add tests for naive engine

* Add latest test changes

* Thread Safety tests in NaiveEngine mode

* Thread Safety tests update

* Update thread safety tests, add unsupported use cases

* Changes to doc and refactor

* Fix todo owner, indentation and mx_float->float

* Refactor cached op code, remove num_threads arg from example

* Fix lint

* Fix warning

* Add back cython, required for unix-gpu build

* Fix for windows

* Add bulking support for thread safe cached op version

* Add support for subgraph testing

* import mxnet before calling get_backend_symbol

* Fix symbol json name

* Refactor DynamicForward

* Add comments

* Add DMLC_ATTRIBUTE_UNUSED

* Fix use_naive_run issue

* Fix lint

* Revert unittest_cpp to old test since it doesnt test thread safety

* Fix doc

Co-authored-by: Sheng Zha <szha@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
Co-authored-by: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com>
Co-authored-by: Leonard Lausen <leonard@lausen.nl>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>
rondogency pushed a commit to rondogency/incubator-mxnet that referenced this pull request Jul 2, 2020
* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests

* Fix download cmd in runtime_functions

* Add CI changes

* Add stage

Fix indentation

* Fix lint

* Change to DEFAULT for C API

* Fix mxnet_unit_tests path

* export correct LD_LIBRARY_PATH

* Add cpp include dirs

* Build test with USE_CPP_PACKAGE

* Add cached op threadsafe version with corresponding C APIs, CPP Package changes, CI changes and tests

* Fix download cmd in runtime_functions

* Merge

* change mkldnn lib name

* Add static_alloc, static_Shape support

* Address review comments

* Make GetCachedOpThreadSafeState similar to cached_op

* Address review comments: comments for locking strategy

* multithreaded inference tutorial

* [Estimator] handle composite metrics in estimator (apache#16676)

* handle composite metrics in estimator

* fix composite metric case in handlers

* remove unused import

* [Estimator] refactor estimator to allow overriding evaluate/fit of a batch (apache#16678)

* refactor estimator to allow overriding evaluate/fit of a batch

* add doc to explain call structure and how to override

* fix and doc

* Pointwise fusion for GPU (apache#15167)

* Beginning of RTC of pointwise ops

* Code generation from the given JSON

* add initial simple_partition_pass and use it for pointwise fusion

* fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code

* Fixes

* Adding support for attribute inference for backward nodes when fusing

* keep proper input ordering for fused Op

* instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph

* Fuse backward

* fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch

* excluse forward node fusion during the fusion of the nodes in the backward graph

* Dealing with fused backward nodes inferattr

* use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph

* Adding support for other reqs in codegen

* Fix

* Cleaning

* Change the TVM submodule

* More cleaning

* Making linter happy

* Do fusion only if default context is GPU

* Fixes for tests
Add powerscalar and rpowerscalar, fix return type of zero and one
Cleaning, fixing lint
Go back to proper TVM submodule

* Fix the TVM commit

* Fix lint

* Guard fusion with MXNET_USE_CUDA

* Fix

* Fix clang-tidy

* Add erf and erfinv backward

* Gluon support for fusion

* Cleaning

* Cleaning and allow shape/type change in FusedOp

* Fixing Gluon bugs

* Fixing after rebase

* Fixing race condition and guarding against races when using NVRTC

* Cleaning and renaming FusedOp to _FusedOp

* Going easy on Windows compiler

* Disable fusion on Windows for now

* Refactor InferAttr and InferShapeAttr

* Added slice and half2 support to FusedOp

* Fix lint errors

* Added multiple types support for vector loading/storing

* add slice fusion when it's at the beginning of subgraphs

* Removed constant ndim assumption in fused op

* Fix memory alignment issue in slice for FusedOp

* Fixes

* Fix lint errors

* Do not include cuda_fp16.h

* Refactor fused op op lists

* Make linter happy

* Changes from review

* Fixes after rebase

* Expand FusedOp support for slice

* Fix for fp16 _zeros and _ones

* Fix

* Moving aux functions to unnamed namespace and detail namespace -> fusion
namespace

* Disabling fusion if it alters topological order of inputs

* Print code only when env variable is set

* Fix

* Fix lint and 2 tests that specify the same names for multiple inputs

* Fixes from review and disabling fusion of slice with non-default step

* Add amp_cast to fusion, fixes

* Add amp_multicast and its backward to the list of support ops

* Apply wording suggestions from code review

Co-Authored-By: Aaron Markham <markhama@amazon.com>

* Apply wording suggestions from code review

Co-Authored-By: Aaron Markham <markhama@amazon.com>

* Make clearer comment

* Adding punctuation and capitalization to \brief descriptions

* Fix

* Fix

* Add backward_cast to fusion

* Adding unittests for fusion. Fix for erfinv_grad

* Adding slice ops and add_n to tests

* Fixes from review

* Setting inplace option

* Fix lint

* Storing double in half

* Retrigger CI

* Slight relaxing of the relative tolerance in the test

* Move the env variable check to the end

* Fix a race condition between InferShape and scheduled Forward

* Fix flakey test_fusion test involving fp32 erfinv op.

* Fix from review

* Added broadcast_like and slice_like to fused op

* Minor fix and cleanup

* Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like

* Added axes support to slice_like

* Added axis support to broadcast_like

* Add fast_load_slice function to fused op code

* Added runtime switch for choosing fast and slow slice kernel

* Fix lint and warning

* Going easy on Windows compiler (again)

* Fix slice_like

* Debug broadcast_like fusion

* Fix lint

* Fix lint

* Trigger CI

* Get rid of the initializer list

* Fix backward calls with different gradient type

* avoid cycle when adding node specific for inputs of subgraph for pointwise fusion

* Fix lint

* Add namespace to the fusion implementations

* Set launch bounds on the fused kernel

* Fix NumPy tests

* Test showcasing an issue fixed in PR apache#16553

* Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b)

Fix lint errors

Fix lint

* Fix a bug in cycle detection for inputs only op in pointwise fusion

* Add comments to simple_partition_pass.h file

* fix install dir (apache#16690)

* [numpy] add numpy operator : append (apache#16564)

* add operator : append ; fix op concatenate when axis = None

* pylint disable

remove mistake

disable pylint

* Initializer.__eq__ (apache#16680)

* fix binary dependencies in CD and nightly (apache#16693)

* [MKL-DNN] Add mxnet mkldnn cmake tutorial (apache#16688)

* add mxnet mkldnn cmake instruction

* imporve doc

* OMP->OpenMP

* Revert "[MKLDNN]Fix reorder2default (apache#16602)" (apache#16697)

This reverts commit dd4eaf5.

* [Estimator] refactor estimator and clarify docs (apache#16694)

* refactor estimator and clarify docs

* fix info message and test

* clean up after releasing logging handler

* Eliminate common expressions (apache#15657)

* Eliminate common expressions from a graph

* Guarding against optimizing out stateful ops and ops that require
resource

* Fix lint

* Added THasDeterministicOutput to multiple ops

* DDebug eliminate common expr

* Added test

* Expose get_optimized_symbol

* Fix

* Fix 2

* Add doc to the Python call

* Add env var MXNET_ELIMINATE_COMMON_EXPR, default true

* Add comments, improve readability of eliminate_common_expr_pass.cc

* Expand testing

* Lower priority of THasDeterministicOutput attr for equal Node test

* Change mx.gpu() to mx.cpu() in tests

* Skip CSE test on Windows (as env variable setting during test does not work there)

* Add missing import sys

* Add missing import logging

* Backport of apache#16711, apache#16737, apache#16408 to 1.6 branch (apache#16763)

* support mixed-precision true_divide (apache#16711)

* [MKLDNN] use dim_t instead of int in slice/transpose operators (apache#16737)

* use dim_t instead of int

* fix same issue in pooling

* rebase code

* trigger CI

* Add MXNet Ops for fast multihead attention (apache#16408)

* add MXNet Ops for fast multihead attention

* add cutlass as 3rdparty dependency

* add cutlass to compilation flags

* remove all cutlass stuff

* add better error message and description and remove cutlass from compilation flags

* change credit for the approach since the code have changed

* fix typos

* correct another typo

* Add all the cuda/cublas helper functions

* remove tests using kAddTo

* only use cublasStridedBatchedGemm if CUDA >= 9.1

* add equivalent mxnet code in description of mha ops

* remove a wrong copy-paste

* add _contrib for namespace and add GPU only on description

* add warning in bwd_ignore_zero_init description, also test with fp32

* add error return if bwd_ignore_zero_init is used without MXNET_EXEC_ENABLE_ADDTO

* remove std::move for clang

* remove bwd_ignore_zero_init flag

* remove bwd_ignore_zero_init in test_operator_gpu.py

* fix typo

* fix another typo

* Removed unrelated test

* Add example and documentation for multi threaded inference

* Add LICENSE

* Add get_model.py

* Add license for README

* Refactor cached op and cached op threadsafe

* Add limitation

* Add tests for naive engine

* Add latest test changes

* Thread Safety tests in NaiveEngine mode

* Thread Safety tests update

* Update thread safety tests, add unsupported use cases

* Changes to doc and refactor

* Fix todo owner, indentation and mx_float->float

* Refactor cached op code, remove num_threads arg from example

* Fix lint

* Fix warning

* Add back cython, required for unix-gpu build

* Fix for windows

* Add bulking support for thread safe cached op version

* Add support for subgraph testing

* import mxnet before calling get_backend_symbol

* Fix symbol json name

* Refactor DynamicForward

* Add comments

* Add DMLC_ATTRIBUTE_UNUSED

* Fix use_naive_run issue

* Fix lint

* Revert unittest_cpp to old test since it doesnt test thread safety

* Fix doc

Co-authored-by: Sheng Zha <szha@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
Co-authored-by: JiangZhaoh <54654391+JiangZhaoh@users.noreply.github.com>
Co-authored-by: Leonard Lausen <leonard@lausen.nl>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.