[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

reminisce · 2019-02-25T23:30:36Z

Motivation

Today deep learning scientists spend majority of their time on data processing, debugging tensor algorithms, and tuning model parameters, instead of architecting models from scratch by themselves as a result from the abundant pre-trained models existing in many deep learning model zoos. This has highlighted the usability of tensor APIs as a key factor for a framework to be widely adopted.

MXNet was firstly designed with the focus on memory efficiency, computation throughput and scalability. The usability problems begin to show up nowadays when more and more models demonstrate dynamic natures, e.g. unknown-shape tensors before runtime, control flow depending on a runtime result, etc. Here we highlight the most frequent complaints about usability from users.

Scalar tensors (aka zero-dim tensors) are not supported. For example, given a = [0, 1, 2], a[1] will generate an NDArray of shape (1,), instead of () as in NumPy.
Zero-size tensor is not supported. For example, a tensor of shape (0, 16, 256) cannot be passed to an operator, because our system currently treats 0, the first dimension size, as unknown, rather than a concrete number.
Many operators' signatures and functionality are not NumPy compatible, e.g. nd.dot vs. np.dot, nd.concatenate vs. np.concatenate, etc.
Many NumPy operators are missing. See the reference link to GitHub issues.
Operators whose outputs' shapes can only be determined at runtime are not supported, e.g. data[data < 0] cannot run.
Diverged programming experience due to the separation of imperative and symbolic operators registered under mxnet.ndarray and mxnet.symbol.
Control flow operators are hard to use. Users have to understand the complicated signatures of control flow operators, instead of writing native Python code using for, while, if/else, etc.
For example, we have learned (in a hard way) that it does not make a lot of sense to ask users to write code like the following to perform a cumulative sum.

def sum(state, i):
    s = state + data[i]
    return s, [s, i + 1]

def sum_cond(state, i):
    return i < 4
    
out, state = F.contrib.while_loop(sum_cond, sum, [F.zeros((1)), F.zeros((1))],
                                  max_iterations=5)

Instead, users should be able to just write native Python code as the following and if required, let the framework serialize it into a computation graph for optimization and deployment.

data = np.arange(5)
out = 0
i = 0
while i < 5:
    out = out + data[i]

It is not hard to figure out that all of the above pain points can be summarized as a result from lack of NumPy-compatible coding experience in MXNet. While addressing the problems of better support of control flow operators and a consolidated coding style for writing imperative and symbolic code with more flexibility requires introducing fundamental changes into the codebase for building new infrastructures, such as a new graph IR and executor, which is extremely non-trivial and should be executed with a long-term plan, we can, at the moment, improve the usability by fixing the issue of zero-dim/size tensors and implementing NumPy operators in MXNet. Please allow us to discuss how to achieve these short-term goals in the following.

Support of zero-dim and zero-size tensors

What's the problem?

Zero-dim and zero-size tensors are valid tensors in NumPy. The former, whose shapes are (), represent scalars in numpy.ndarray format. The latter, which have one or multiple zero dimension sizes in shapes, can be useful as a placeholder for many ndarray operations, such as concatenating a zero-size ndarray with another ndarray. MXNet does not support them due to the reserved semantics of empty shape () and shapes with zero dimension sizes indicating unknown shape information. Such information need to be filled out during the shape inference stage in order to move forward to tensor computations later.

How to resolve the problem?

We can first change the current semantics to comply with NumPy definition.

Change the definition of unknown shapes from ndim = 0 to ndim = -1 in TShape class.
Change the definition of unknown dimension sizes from dim_size = 0 to dim_size = -1 in TShape class.

After this, we need to scan all over the codebase to modify the code accordingly where shape.ndim() == 0 and shape.Size() == 0 is used to perform unknown shape checks.

Please note that although MXNet's shape is a type inheriting from nnvm::Tuple, which is often used to represent an list-like object, such as axis=(1, 2, 3), we will not change the meaning of an empty tuple. This separation of definitions for empty shape and empty tuple keeps the their roles clearly decoupled.

We propose to breakdown the efforts into the following steps.

Copy tuple.h from NNVM to MXNet and rename nnvm::TShape to mxnet::TShape.
Replace all the places in MXNet where nnvm::Tuple and nnvm::TShape are used with mxnet::Tuple and mxnet::TShape, respectively.
Change the definition of TShape in tuple.h to use ndim = -1 to indicate unknown shapes and dim_size = -1 to indicate unknown shape dim sizes.
Modify all the existing shape inference and utility functions where ndim == 0 and dim_size == 0 is used to accommodate the above changes.
Modify NNVM passes, InferShape, PlanMemory, and Gradient, where nnvm::TShape is used, to accommodate the above changes.
Add sufficient unit tests.

How is backward compatibility guaranteed?

By default, we do not change the original definition of output shapes in shape inference functions; we just change ndim==0 to ndim==-1 for unknown shape verification. No backward compatibility issues are expected for all but one case, NDArray indexing. To elaborate, the current behavior determines that x[i] always returns a tensor with ndim >= 1. We can keep the current behavior unchanged and implement a global switch for users to turn on for expecting NumPy-compatible results.

Previous discussion of this topic can be seen here.

Implementation of NumPy operators

What to do?

To address the problems of operator incompatibility with NumPy and alleviate the pain of diverged programming experience due to the operator namespace separation: mxnet.ndarray and mxnet.symbol, we propose creating a new namespace mxnet.numpy, adopting operator APIs from NumPy, and implementing those operator APIs under the namespace. mxnet.numpy should provide the same imperative programming experience as NumPy and will gradually replace all the non-neural-network operators in the current codebase. While implementing NumPy operators in MXNet, it is possible for us to leverage TVM to generate high-performance kernels (ref.).

Can `mxnet.numpy` operators be used in Gluon for hybridization?

The newly implemented NumPy operators can still be accessed through the module (ndarray/symbol) delegate F in Gluon, e.g. F.numpy.dot. This works because the new operators are still registered under mxnet.ndarray and mxnet.symbol behind the scene. It is just that users are encouraged to access NumPy operator APIs through mxnet.numpy to write pure imperative code and Gluon APIs for achieving hybrid coding experience.

Where to contribute code?

A dev branch has been opened for this proposal.
https://github.com/apache/incubator-mxnet/tree/numpy

@junrushao1994 @szha @eric-haibin-lin @zheng-da @yzhliu

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-02-25T23:30:39Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

junrushao · 2019-02-25T23:36:10Z

+1 for this RFC.

Numpy compatibility has been long existing desire from both developers and users. It is very meaningful if we could make it possible.

apeforest · 2019-02-25T23:52:37Z

+1 for this RFC.

The inconsistent APIs even within MXNet operators itself caused much confusing for users. It will be a great improvement in usability if we can make MXNet APIs compatible with Numpy.

I would suggest that we establish a formal review process for PRs that includes API change or addition to prevent from creating inconsistent APIs in the future.

ifeherva · 2019-02-25T23:59:28Z

+1 for this RFC.

I especially like the numpy namespace proposal, that will help cleaning up a lot of thing.

My experience is that the major blocker for numpy compatibility (and bad user experience) is due to the lack of dynamic shape inference. I cannot wait to have that out.

Anyways, since I wrote a handful of operators already I am very happy to lend a hand in getting fully numpy-compatible once dynamic shape inference is done.

nickguletskii · 2019-02-26T07:40:08Z

+1 for handling zero-size arrays.

I'm not that concerned about numpy compatibility, but the lack of zero-size arrays is something that I would like to see fixed, since the current situation means that empty arrays have to be carefully padded to not cause any problems.

lanking520 · 2019-02-26T21:10:21Z

+1 for this RFC.

The consistent experience would also help JVM language binding to be in sync with Python. It reduce the bar for users familar with Python to write the same thing in Scala.

wkcn · 2019-02-27T08:38:52Z

+1 for this RFC.

It will be more flexible to use MXNet, especially in slicing, and I hope mx.numpy could eliminate the divergence between mx.nd and mx.sym. : )

I wonder how to implement mx.numpy: using Python ast module to extract the abstract syntax tree then run them on JIT, or implement it on Python entirely? We should also focus on the deployment of mx.numpy.

I do not think F.numpy.dot is a good idea, since it is confusing that mx.numpy, mx.nd.numpy and mx.sym.numpy all exist. We only need mx.numpy to support mx.numpy.dot(a_nd, b_nd) and mx.numpy.dot(a_sym, b_sym).

reminisce · 2019-02-27T18:21:36Z

@wkcn All of what you have said make sense. :)

Gluon APIs, GluonNLP and GluonCV highly depend on the current MXNet infra. So we have to execute it in an organized and steady stream in order not to break backward compatibility. Current NNVM has its own limitations in expressing dynamic shapes and control flow operators. We will eventually need a new IR (Relay is an option) to do AST transformation.

anirudh2290 · 2019-02-28T03:24:57Z

Thanks for the RFC!

It is just that users are encouraged to access NumPy operator APIs through mxnet.numpy to write pure imperative code and Gluon APIs for achieving hybrid coding experience.

Earlier mxnet.ndarray was supposed to give you the experience of writing pure imperative code. Why can't we add the operators under this namespace and make the interface changes for existing operators ? Is there a list of operators which have diverged APIs for numpy and ndarray and can it be timed with 2.0 release?

We can keep the current behavior unchanged and implement a global switch for users to turn on for expecting NumPy-compatible results.

If I understand correctly, even when using numpy namespace you need to toggle this switch(probably an env variable?) to obtain the correct slicing ? Have you also considered implementing a seperating numpy ndarray from base with specific functions for slicing like __getitem__ implemented to avoid using this switch.

szha · 2019-02-28T03:45:53Z

@anirudh2290

Why can't we add the operators under this namespace and make the interface changes for existing operators ?

We can. However, there exist some operators in mxnet.ndarray whose names are the same as numpy counterparts while the behavior are slightly different, this means they cannot exist in the same namespace if we want to preserve backward compatibility. On the other hand, 2.0 is a good opportunity for fixing many of the existing problems besides the operator behaviors, so we'd likely want to take the time. Thus, to start now, having a new namespace would be the most straightforward way to go.

Have you also considered implementing a seperating numpy ndarray

Yes. Creating different array types means we'd start to see diverging user code, with some in ndarray and some in numpy ndarray, which would become harder to migrate later.

TaoLv · 2019-02-28T06:33:22Z

@reminisce @szha NumPy has reference/view and stride in its NDArray structure whille MXNet.NDArray doesn't have. How does this impact the design of NumPy-compatible coding experience?

junrushao · 2019-02-28T06:41:00Z

@TaoLv In neural nets, once you do backprop, you cannot overwrite data because it destroys checkpointing.

TaoLv · 2019-02-28T07:32:54Z

Not sure I understand the checkpointing. Can you explain a bit more? I think we have memory planning pass to decide whether the data can be overwritten? Also there are NumPy-based framework like Theano and Chainer.

reminisce · 2019-02-28T18:02:50Z

@TaoLv MXNet can have the same concept as in NumPy for view with the implementation of strides. But I think it's not the first priority for us to do so, because they are rarely useful in training (maybe useful in data preprocessing). @junrushao1994 's point is that in-place assignment is invalid in BP as it will wipe out pre-stored autograd information. This is consistent with other DL frameworks.

apeforest · 2019-02-28T18:46:12Z

We can. However, there exist some operators in mxnet.ndarray whose names are the same as numpy counterparts while the behavior are slightly different, this means they cannot exist in the same namespace if we want to preserve backward compatibility.

Do we really have to carry this burden of backward compatibility all the way beyond 2.0? I feel existing operators are confusing enough that 2.0 maybe a good time for us to make the API clean and easy to use. Would adding a new name space mx.numpy to the existing mx.sym and mx.ndarray cause more confusion to new users?

reminisce · 2019-02-28T21:05:46Z

@apeforest Because MXNet guarantees backward compatibility, those two namespaces have to be kept till 2.0. Adding namespace numpy lowers the bar for data scientists from NumPy community to use the DL framework. As for the framework itself, the purpose is to deemphasize the difference between mxnet.symbol and mxnet.ndarray in this major release. To retire those two namespaces in 1.x.x, one practical thing in the future we can do is register all ops under the namespaces like numpy, nn, etc. with unified interfaces supporting both NDArray and Symbol arguments, and in Gluon, we can remove the second-level module delegate F..

apeforest · 2019-02-28T23:22:09Z

@reminisce I am fine with keeping those two namespaces till 2.0 for backward compatibility. Starting from 2.0, I feel we may want to just drop mx.ndarray, mx.symbol and make mx.numpy the only name space to users. I like the unified interface idea you propopsed.

mouryarishik · 2019-04-05T12:39:35Z

+1 for this RFC.

larroy · 2019-05-22T07:55:27Z

What's the plan regarding: "Instead, users should be able to just write native Python code as the following and if required, let the framework serialize it into a computation graph for optimization and deployment." I would get the python AST and convert it to a computational graph, seems that part is not described into detail, I guess is a long-term phase.

szha · 2020-07-31T18:28:59Z

This feature has been made available as experimental feature 1.6 and will be supported in 2.0. Thanks to everyone who contributed to this major feature

reminisce added the RFC Post requesting for comments label Feb 25, 2019

reminisce mentioned this issue Feb 27, 2019

Move TShape definition and necessary passes out of NNVM #14266

Closed

junrushao mentioned this issue Feb 27, 2019

[MXNET-1330] Bring nnvm::Tuple to mxnet::Tuple #14270

Merged

8 tasks

This was referenced Mar 4, 2019

[numpy] Shape support scalar tensor #14315

Merged

[numpy] Implement NumPy operators #14327

Closed

reminisce mentioned this issue Apr 10, 2019

[numpy] Support zero-dim and zero-size tensors in MXNet #14661

Merged

anirudh2290 mentioned this issue Apr 15, 2019

[bug]mx.nd.stack couldn't stack scalars #14594

Closed

yzhliu mentioned this issue Jul 5, 2019

[RFC] Integrate TVM into Apache MXNet #15465

Open

wkcn closed this as completed Aug 29, 2019

wkcn reopened this Aug 29, 2019

szha mentioned this issue Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

eric-haibin-lin mentioned this issue Sep 25, 2019

Inconsistency between ndarray and symbol when performing division #10377

Open

This was referenced Oct 4, 2019

[RFC] Deferred compute in imperative interface to unify imperative and symbolic interface #16375

Closed

[RFC] Deferred compute in imperative interface to unify imperative and symbolic interface #16376

Closed

leezu mentioned this issue Oct 29, 2019

[Refactor] TransformerDecoder dmlc/gluon-nlp#976

Merged

5 tasks

lostella mentioned this issue Nov 5, 2019

mxnet 1.5.0 compatibility awslabs/gluonts#245

Closed

leezu mentioned this issue Nov 20, 2019

TransformerEncoder expects valid_length float dtype, but BERTSentenceTransform returns int dtype dmlc/gluon-nlp#1014

Open

leezu mentioned this issue Dec 7, 2019

Concatenate with empty array fails. #17005

Closed

leezu mentioned this issue Jan 31, 2020

tf.boolean_mask equivalent in MxNet #7443

Closed

szha mentioned this issue Feb 24, 2020

[RFC] MXNet 2.0 API Deprecation #17676

Open

roywei mentioned this issue Apr 25, 2020

Unknown shape in flatten op. #18165

Open

leezu mentioned this issue May 11, 2020

Support for zero-dimensional tensors is broken #18271

Closed

szha closed this as completed Jul 31, 2020

szha mentioned this issue Aug 15, 2020

[Development] MXNet 2.0 Update #18931

Open

This was referenced Dec 16, 2020

Floating point exception in mxnet.contrib.ndarray.arange_like #19683

Open

Segmentation Fault in mxnet.ndarray.split_v2 #19684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

reminisce commented Feb 25, 2019

mxnet-label-bot commented Feb 25, 2019

junrushao commented Feb 25, 2019

apeforest commented Feb 25, 2019 •

edited

Loading

ifeherva commented Feb 25, 2019 •

edited

Loading

nickguletskii commented Feb 26, 2019

lanking520 commented Feb 26, 2019

wkcn commented Feb 27, 2019

reminisce commented Feb 27, 2019

anirudh2290 commented Feb 28, 2019 •

edited

Loading

szha commented Feb 28, 2019

TaoLv commented Feb 28, 2019

junrushao commented Feb 28, 2019

TaoLv commented Feb 28, 2019

reminisce commented Feb 28, 2019

apeforest commented Feb 28, 2019 •

edited

Loading

reminisce commented Feb 28, 2019

apeforest commented Feb 28, 2019 •

edited

Loading

mouryarishik commented Apr 5, 2019

larroy commented May 22, 2019

szha commented Jul 31, 2020 •

edited

Loading

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

Comments

reminisce commented Feb 25, 2019

Motivation

Support of zero-dim and zero-size tensors

What's the problem?

How to resolve the problem?

How is backward compatibility guaranteed?

Implementation of NumPy operators

What to do?

Can mxnet.numpy operators be used in Gluon for hybridization?

Where to contribute code?

mxnet-label-bot commented Feb 25, 2019

junrushao commented Feb 25, 2019

apeforest commented Feb 25, 2019 • edited Loading

ifeherva commented Feb 25, 2019 • edited Loading

nickguletskii commented Feb 26, 2019

lanking520 commented Feb 26, 2019

wkcn commented Feb 27, 2019

reminisce commented Feb 27, 2019

anirudh2290 commented Feb 28, 2019 • edited Loading

szha commented Feb 28, 2019

TaoLv commented Feb 28, 2019

junrushao commented Feb 28, 2019

TaoLv commented Feb 28, 2019

reminisce commented Feb 28, 2019

apeforest commented Feb 28, 2019 • edited Loading

reminisce commented Feb 28, 2019

apeforest commented Feb 28, 2019 • edited Loading

mouryarishik commented Apr 5, 2019

larroy commented May 22, 2019

szha commented Jul 31, 2020 • edited Loading

Can `mxnet.numpy` operators be used in Gluon for hybridization?

apeforest commented Feb 25, 2019 •

edited

Loading

ifeherva commented Feb 25, 2019 •

edited

Loading

anirudh2290 commented Feb 28, 2019 •

edited

Loading

apeforest commented Feb 28, 2019 •

edited

Loading

apeforest commented Feb 28, 2019 •

edited

Loading

szha commented Jul 31, 2020 •

edited

Loading