Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Discussion] 1.6.0 Roadmap #15589

Closed
szha opened this issue Jul 18, 2019 · 21 comments
Closed

[Discussion] 1.6.0 Roadmap #15589

szha opened this issue Jul 18, 2019 · 21 comments
Labels

Comments

@szha
Copy link
Member

szha commented Jul 18, 2019

Let's start a discussion here about the roadmap towards 1.6.0. We are looking for:

New features that are useful to your research and development.
Improvements and patches to existing features.
If you have any item that you'd like to propose to have in the roadmap, please do:

Create (or locate existing) issue/pull request for the item, note the issue/pull request number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indicate whether you'd be willing to help out on the item.
Share the ETA if you're driving the item and have an guesstimate on when it will be done.

Feel free to include items that weren't included in past roadmap discussions that you still wish to include in this release.
cc @apache/mxnet-committers

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

@szha szha pinned this issue Jul 18, 2019
@szha szha added the Roadmap label Jul 18, 2019
@sandeep-krishnamurthy
Copy link
Contributor

  1. Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

@anirudhacharya
Copy link
Member

anirudhacharya commented Jul 18, 2019

  1. Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

There are a few regressions that have been introduced since the large tensor support was added. For example, there are a few hierarchical attention networks which were training fine until 1.3.1 but since the addition of large tensor support, results in NaNs in weight vectors and gradient calculations. We probably should ensure all regressions are fixed with regard to this feature.

Related issues and PRs that might be relevant( the above issue is in addition to what is discussed in the below links) -

@pengzhao-intel
Copy link
Contributor

@sandeep-krishnamurthy Agree INT64 enhancements.
We plan to upgrade MKLDNN to 1.0 in r1.6 so that MKLDNN backend can work for INT64 index.

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Jul 19, 2019

CPU related proposal for 1.6 (keep updating in the following several days).

  1. WIP, MKL-DNN upgrade to 1.0
  1. DONE, MKL-DNN Subgraph Fusion as default @ZhennanQin [MKLDNN] Enable subgraph backend mkldnn by default. #15518

  2. DONE, Quantization API improvement @xinyu-intel [MKLDNN]Enhance Quantization APIs and Tutorial #15448

  3. WIP, RNN (vRNN/LSTM/GRU) with MKL-DNN acceleration @ciyongch @zixuanweeei

  4. Enable new models, like NCF, @xinyu-intel

@wkcn
Copy link
Member

wkcn commented Jul 19, 2019

Agree INT64 enhancements, too.
For the regression problem, a good solution is to choose data type by the data size in the performance bottleneck.

The following features are useful for research and development:

  1. Need register_backward_hook() function in mxnet Need register_backward_hook() function in mxnet #15411
    backward_hook is useful to watch gradient for debug, and modify gradient like normalizing it.

@Neutron3529
Copy link
Contributor

Neutron3529 commented Jul 26, 2019

I think make (pre-trained) NN editable directly may helpful.
For example, I want to train a NN, whose input is (data,aux_data) and its output is 'predict'.
with the progression of training, I may found that aux_data is useless and may introduce additional bias in both training and testing.
Fortunately, since L2-regularization will make all the params related to aux_data near 0, I could directly set all the params related to aux_data to 0.
But, the net with new parameters are slower than a new net with the pre-trained params.
I think it is useful to enable direct editing for NNs.
I think it will be more fancy:

batch_size=1
data=mx.sym.var('data',shape=(batch_size,1))
aux_data=mx.sym.var('aux_data',shape=(batch_size,1))
combine=mx.sym.Concat(data,aux,dim=-1)
out=mx.sym.FullyConnected(combine,num_hidden=3).softmax()
net=mx.mod.Module(symbol=out,data_names=('data','aux_data'))
#(train and found that aux is useless)
net.sym#`.sym` returns the names of symbols used in the net
net.pop('aux_data')#will remove aux from net, and further, the net does not need `Concat` function anymore, and the params for `out` reduce 3 since `aux` is removed.

Since it is very inconvenient to change the params in ParameterDict(I only knows that use mx.init.Constant may help, but that is too inconvenient for me.), add a pop method to the net may help.


What's more, I think MXNet needs a relocatable .dll/.so file
libmxnet.dll is too large and takes too much time to load in my Windows 10

I asked how to decrease the dll size (since there is too much archs in the single .dll file and what I want is just -arch=sm_61 for my GTX 1060)

The reply is to use nvprune, I tried it and it gives me an error:

nvprune fatal   : Input file 'libmxnet.dll' not relocatable

Make the libmxnet.dll/libmxnet.so relocatable will provides the possibility to further decrease the size of the dll file. And may further more decrease the import time.
It seems the symbol are contained in libmxnet.lib, but I can not merge these symbols into libmxnet.dll. If someone find a way to redistribute a libmxnet.dll with symbols, we may take less time import mxnet in python

@sandeep-krishnamurthy
Copy link
Contributor

sandeep-krishnamurthy commented Aug 1, 2019

It would be nice to add support for default checkpointing in Estimator API. For reference, TF estimator APIs does provide default checkpointing of models trained with Estimator API, which is very useful for users.

Also, can we plan to graduate Estimator API in MXNet 1.6? This is super useful API for MXNet users.

@roywei - Any comments?

@apeforest
Copy link
Contributor

Thanks to @ChaiBapchya we now have performance comparison data between int32 and int64: https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit#gid=843443107

@ptrendx
Copy link
Member

ptrendx commented Aug 29, 2019

We have multiple improvements to BERT inference and training speed that we would like to be part of 1.6 release:

@roywei
Copy link
Member

roywei commented Aug 29, 2019

Moving fixing nightly failure from 1.5.1 scope to 1.6.0 as they are failing on master branch not 1.5.x branch.
#15613 (comment)

nightly test failure need to be fixed:
#15374

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/395/pipeline/

@szha
Copy link
Member Author

szha commented Sep 13, 2019

@reminisce @haojin2 given that numpy operators would be a major topic in 1.6 release, it would be great if you could add what you intend to include in this release.

@hkingtswcbyy
Copy link

If I want to use mxnet.gluon.nn.Conv2D to get a depthwise conv layer, I need to explicit set the group argument. Can this be infered automatically?

@reminisce
Copy link
Contributor

reminisce commented Sep 29, 2019

As for NumPy compatibility, we would like to add

  1. A set of NumPy operators mostly with CPU/GPU forward/backward supports. Number TBD.
  2. A new ndarray class that replicates NumPy's ndarray in most ways. (Differences will be documented).
  3. NumPy op integration with Gluon components, including layers, model zoo, parameters, data loaders, etc.

@KellenSunderland
Copy link
Contributor

TRT is now working from the CPP package. I think to consider it a released feature we'd want to update documentation and possible target the new TRT version (6).

@anirudh2290
Copy link
Member

I am working on an interface for multi threaded inference in MXNet and it would be great if it could go in 1.6.

@szha
Copy link
Member Author

szha commented Oct 5, 2019

@anirudh2290 this sounds like a larger change. Would you link to the RFC for it?

@anirudh2290
Copy link
Member

@szha yes I am planning to add a RFC this week.

@leezu
Copy link
Contributor

leezu commented Oct 21, 2019

For reference, users are confused that Large Tensor Support was enabled in MXNet 1.4 and then disabled again in 1.5. Reference: dmlc/gluon-nlp#981
For 1.6, can we improve the error message to suggest people to compile with Large Tensor Support?

@access2rohit
Copy link
Contributor

@leezu I am working on it. Here is the WIP PR: #16570

@JonTanS
Copy link
Contributor

JonTanS commented Nov 18, 2019

Hi I was doing some testing for mxnet 1.6.x and 1.5.1 and I noticed some performance issues in training you can find more details here: #16845

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests