[Discussion] 1.6.0 Roadmap #15589

szha · 2019-07-18T17:44:34Z

Let's start a discussion here about the roadmap towards 1.6.0. We are looking for:

New features that are useful to your research and development.
Improvements and patches to existing features.
If you have any item that you'd like to propose to have in the roadmap, please do:

Create (or locate existing) issue/pull request for the item, note the issue/pull request number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indicate whether you'd be willing to help out on the item.
Share the ETA if you're driving the item and have an guesstimate on when it will be done.

Feel free to include items that weren't included in past roadmap discussions that you still wish to include in this release.
cc @apache/mxnet-committers

mxnet-label-bot · 2019-07-18T17:44:43Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

sandeep-krishnamurthy · 2019-07-18T17:52:52Z

Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

anirudhacharya · 2019-07-18T18:22:15Z

Can we work towards INT-64 support (Large Tensor) as DEFAULT. This would make it easy for Large Tensor use cases such as DGL, recommendation system models to use MXNet easily. Today, they need to compile from source. (CC @apeforest @access2rohit)

There are a few regressions that have been introduced since the large tensor support was added. For example, there are a few hierarchical attention networks which were training fine until 1.3.1 but since the addition of large tensor support, results in NaNs in weight vectors and gradient calculations. We probably should ensure all regressions are fixed with regard to this feature.

Related issues and PRs that might be relevant( the above issue is in addition to what is discussed in the below links) -

pengzhao-intel · 2019-07-19T01:32:34Z

@sandeep-krishnamurthy Agree INT64 enhancements.
We plan to upgrade MKLDNN to 1.0 in r1.6 so that MKLDNN backend can work for INT64 index.

pengzhao-intel · 2019-07-19T01:35:05Z

CPU related proposal for 1.6 (keep updating in the following several days).

WIP, MKL-DNN upgrade to 1.0

the proposal will be sent out soon in dev@ @TaoLv
https://github.com/apache/incubator-mxnet/projects/16

DONE, MKL-DNN Subgraph Fusion as default @ZhennanQin [MKLDNN] Enable subgraph backend mkldnn by default. #15518
DONE, Quantization API improvement @xinyu-intel [MKLDNN]Enhance Quantization APIs and Tutorial #15448
WIP, RNN (vRNN/LSTM/GRU) with MKL-DNN acceleration @ciyongch @zixuanweeei
Enable new models, like NCF, @xinyu-intel

wkcn · 2019-07-19T02:02:54Z

Agree INT64 enhancements, too.
For the regression problem, a good solution is to choose data type by the data size in the performance bottleneck.

The following features are useful for research and development:

Need register_backward_hook() function in mxnet Need register_backward_hook() function in mxnet #15411
backward_hook is useful to watch gradient for debug, and modify gradient like normalizing it.

Neutron3529 · 2019-07-26T07:56:09Z

I think make (pre-trained) NN editable directly may helpful.
For example, I want to train a NN, whose input is (data,aux_data) and its output is 'predict'.
with the progression of training, I may found that aux_data is useless and may introduce additional bias in both training and testing.
Fortunately, since L2-regularization will make all the params related to aux_data near 0, I could directly set all the params related to aux_data to 0.
But, the net with new parameters are slower than a new net with the pre-trained params.
I think it is useful to enable direct editing for NNs.
I think it will be more fancy:

batch_size=1
data=mx.sym.var('data',shape=(batch_size,1))
aux_data=mx.sym.var('aux_data',shape=(batch_size,1))
combine=mx.sym.Concat(data,aux,dim=-1)
out=mx.sym.FullyConnected(combine,num_hidden=3).softmax()
net=mx.mod.Module(symbol=out,data_names=('data','aux_data'))
#(train and found that aux is useless)
net.sym#`.sym` returns the names of symbols used in the net
net.pop('aux_data')#will remove aux from net, and further, the net does not need `Concat` function anymore, and the params for `out` reduce 3 since `aux` is removed.

Since it is very inconvenient to change the params in ParameterDict(I only knows that use mx.init.Constant may help, but that is too inconvenient for me.), add a pop method to the net may help.

What's more, I think MXNet needs a relocatable .dll/.so file
libmxnet.dll is too large and takes too much time to load in my Windows 10

I asked how to decrease the dll size (since there is too much archs in the single .dll file and what I want is just -arch=sm_61 for my GTX 1060)

The reply is to use nvprune, I tried it and it gives me an error:

nvprune fatal   : Input file 'libmxnet.dll' not relocatable

Make the libmxnet.dll/libmxnet.so relocatable will provides the possibility to further decrease the size of the dll file. And may further more decrease the import time.
It seems the symbol are contained in libmxnet.lib, but I can not merge these symbols into libmxnet.dll. If someone find a way to redistribute a libmxnet.dll with symbols, we may take less time import mxnet in python

sandeep-krishnamurthy · 2019-08-01T01:12:55Z

It would be nice to add support for default checkpointing in Estimator API. For reference, TF estimator APIs does provide default checkpointing of models trained with Estimator API, which is very useful for users.

Also, can we plan to graduate Estimator API in MXNet 1.6? This is super useful API for MXNet users.

@roywei - Any comments?

apeforest · 2019-08-27T21:28:53Z

Thanks to @ChaiBapchya we now have performance comparison data between int32 and int64: https://docs.google.com/spreadsheets/d/1GpdNquQb71Is5B-li99JDuiLeEZd-eSjHIIowzGrwxc/edit#gid=843443107

ptrendx · 2019-08-29T21:43:01Z

We have multiple improvements to BERT inference and training speed that we would like to be part of 1.6 release:

Softmax optimizations (Softmax optimization for GPU #15545 )
Pointwise fusion for GPU (Pointwise fusion for GPU #15167 )
Eliminate common expressions (Eliminate common expressions #15657 )
Bias speed improvements (FullyConnected Bias performance improvement on GPU #16039 )
Aggregated AdamW optimizer (Aggregated adamw update #16398)
Aggregated zeroing of the gradients (Aggregated zero grad #16446)
Aggregated sum of squares operator (also used in LARS, Add fast implementation of LARS #16122)
Embedding gradient optimization (Embedding gradient performance optimization on GPU #16355)
Faster multihead attention operator (Add MXNet Ops for fast multihead attention #16408)

roywei · 2019-08-29T23:17:48Z

Moving fixing nightly failure from 1.5.1 scope to 1.6.0 as they are failing on master branch not 1.5.x branch.
#15613 (comment)

nightly test failure need to be fixed:
#15374

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/395/pipeline/

szha · 2019-09-13T17:55:54Z

@reminisce @haojin2 given that numpy operators would be a major topic in 1.6 release, it would be great if you could add what you intend to include in this release.

hkingtswcbyy · 2019-09-29T05:47:48Z

If I want to use mxnet.gluon.nn.Conv2D to get a depthwise conv layer, I need to explicit set the group argument. Can this be infered automatically?

reminisce · 2019-09-29T06:21:12Z

As for NumPy compatibility, we would like to add

A set of NumPy operators mostly with CPU/GPU forward/backward supports. Number TBD.
A new ndarray class that replicates NumPy's ndarray in most ways. (Differences will be documented).
NumPy op integration with Gluon components, including layers, model zoo, parameters, data loaders, etc.

KellenSunderland · 2019-10-03T05:46:46Z

TRT is now working from the CPP package. I think to consider it a released feature we'd want to update documentation and possible target the new TRT version (6).

anirudh2290 · 2019-10-03T16:19:33Z

I am working on an interface for multi threaded inference in MXNet and it would be great if it could go in 1.6.

szha · 2019-10-05T20:10:05Z

@anirudh2290 this sounds like a larger change. Would you link to the RFC for it?

anirudh2290 · 2019-10-07T06:21:08Z

@szha yes I am planning to add a RFC this week.

leezu · 2019-10-21T21:23:13Z

For reference, users are confused that Large Tensor Support was enabled in MXNet 1.4 and then disabled again in 1.5. Reference: dmlc/gluon-nlp#981
For 1.6, can we improve the error message to suggest people to compile with Large Tensor Support?

access2rohit · 2019-10-21T21:42:47Z

@leezu I am working on it. Here is the WIP PR: #16570

JonTanS · 2019-11-18T01:35:40Z

Hi I was doing some testing for mxnet 1.6.x and 1.5.1 and I noticed some performance issues in training you can find more details here: #16845

szha pinned this issue Jul 18, 2019

szha added the Roadmap label Jul 18, 2019

szha mentioned this issue Jul 18, 2019

[Discussion] 1.5.0 Roadmap #14619

Closed

szha unpinned this issue Oct 17, 2019

leezu mentioned this issue Oct 21, 2019

FastText embeddings not working on MxNET 1.5.1/GluonNLP 0.8.1 dmlc/gluon-nlp#981

Closed

pengzhao-intel mentioned this issue Nov 20, 2019

[Discussion] 1.7.0 Roadmap #16864

Open

szha closed this as completed Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] 1.6.0 Roadmap #15589

[Discussion] 1.6.0 Roadmap #15589

szha commented Jul 18, 2019

mxnet-label-bot commented Jul 18, 2019

sandeep-krishnamurthy commented Jul 18, 2019

anirudhacharya commented Jul 18, 2019 •

edited

Loading

pengzhao-intel commented Jul 19, 2019

pengzhao-intel commented Jul 19, 2019 •

edited

Loading

wkcn commented Jul 19, 2019 •

edited

Loading

Neutron3529 commented Jul 26, 2019 •

edited

Loading

sandeep-krishnamurthy commented Aug 1, 2019 •

edited

Loading

apeforest commented Aug 27, 2019

ptrendx commented Aug 29, 2019 •

edited by eric-haibin-lin

Loading

roywei commented Aug 29, 2019

szha commented Sep 13, 2019

hkingtswcbyy commented Sep 29, 2019

reminisce commented Sep 29, 2019 •

edited

Loading

KellenSunderland commented Oct 3, 2019

anirudh2290 commented Oct 3, 2019

szha commented Oct 5, 2019

anirudh2290 commented Oct 7, 2019

leezu commented Oct 21, 2019 •

edited

Loading

access2rohit commented Oct 21, 2019

JonTanS commented Nov 18, 2019 •

edited

Loading

[Discussion] 1.6.0 Roadmap #15589

[Discussion] 1.6.0 Roadmap #15589

Comments

szha commented Jul 18, 2019

mxnet-label-bot commented Jul 18, 2019

sandeep-krishnamurthy commented Jul 18, 2019

anirudhacharya commented Jul 18, 2019 • edited Loading

pengzhao-intel commented Jul 19, 2019

pengzhao-intel commented Jul 19, 2019 • edited Loading

wkcn commented Jul 19, 2019 • edited Loading

Neutron3529 commented Jul 26, 2019 • edited Loading

sandeep-krishnamurthy commented Aug 1, 2019 • edited Loading

apeforest commented Aug 27, 2019

ptrendx commented Aug 29, 2019 • edited by eric-haibin-lin Loading

roywei commented Aug 29, 2019

szha commented Sep 13, 2019

hkingtswcbyy commented Sep 29, 2019

reminisce commented Sep 29, 2019 • edited Loading

KellenSunderland commented Oct 3, 2019

anirudh2290 commented Oct 3, 2019

szha commented Oct 5, 2019

anirudh2290 commented Oct 7, 2019

leezu commented Oct 21, 2019 • edited Loading

access2rohit commented Oct 21, 2019

JonTanS commented Nov 18, 2019 • edited Loading

anirudhacharya commented Jul 18, 2019 •

edited

Loading

pengzhao-intel commented Jul 19, 2019 •

edited

Loading

wkcn commented Jul 19, 2019 •

edited

Loading

Neutron3529 commented Jul 26, 2019 •

edited

Loading

sandeep-krishnamurthy commented Aug 1, 2019 •

edited

Loading

ptrendx commented Aug 29, 2019 •

edited by eric-haibin-lin

Loading

reminisce commented Sep 29, 2019 •

edited

Loading

leezu commented Oct 21, 2019 •

edited

Loading

JonTanS commented Nov 18, 2019 •

edited

Loading