Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Large Tensor] Add LT support for NN optimizers and 1 activation function #17444

Merged
merged 4 commits into from
Feb 10, 2020

Conversation

ChaiBapchya
Copy link
Contributor

@ChaiBapchya ChaiBapchya commented Jan 27, 2020

Description

Add large tensor support to optimizers and 1 activation function

  • hard_sigmoid
  • adam_update
  • ftml_update
  • mp_sgd_mom_update
  • mp_sgd_update
  • rmsprop_update
  • rmspropalex_update
  • sgd_mom_update
  • sgd_update
  • signsgd_update
  • signum_update
  • nagmom
  • mp_nagmom
  • lamb
  • mp_lamb
  • ftrl
  • adagrad

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • modified: src/operator/optimizer_op-inl.h
  • modified: src/operator/tensor/elemwise_unary_op.h

Comments

Tested hard_sigmoid with LT input : Pass

>>> import mxnet as mx
>>> mx.nd.hard_sigmoid(data=mx.nd.random_normal(shape=(1, 2**32 + 1))) [[0.9424413 0.6548008 0.7086881 ... 0.53579605 0.37985992 0.20645571]] 
<NDArray 1x4294967297 @cpu(0)>

Rest of the *_update functions can't be tested with random_normal inputs as they give NaNs as result (even for shape < 2**32)
Hence not tested. But they don't give a segmentation fault (which previously was the problem due to lack of Large tensor support).

@access2rohit
Copy link
Contributor

@ChaiBapchya can you paste the tests run log of opperf indicating they run fine w/o giving SIGSEGV

@access2rohit
Copy link
Contributor

access2rohit commented Jan 28, 2020

Looks like a lot of ops didn't have correct type for mapping index of input values. LGTM but I would like to see logs that this doesn't segfault.

@szha
Copy link
Member

szha commented Jan 30, 2020

cc @szhengac

@szhengac
Copy link
Contributor

I think this needs to be tested on training a large model.

@ChaiBapchya
Copy link
Contributor Author

ChaiBapchya commented Jan 31, 2020

@szhengac Which model? Which dataset? Can you give some specifics?
Thanks

@szhengac
Copy link
Contributor

@szhengac Which model? Which dataset? Can you give some specifics?
Thanks

I think a toy example with a very wide dense layer is good.

@ChaiBapchya
Copy link
Contributor Author

So I tested MXNet (build from source using this branch)
with flags :

python -c "from mxnet.runtime import feature_list; print(feature_list())"
[✔ CUDA, ✔ CUDNN, ✖ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✔ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]

Results for training 10 epochs on 8 GPUS

INFO:root:[Epoch 0] train=0.120292 val=0.158000 loss=6.658037 time: 109.734473
  INFO:root:[Epoch 1] train=0.167548 val=0.179600 loss=2.297145 time: 92.212359
  INFO:root:[Epoch 2] train=0.210777 val=0.237700 loss=2.109626 time: 92.110430
  INFO:root:[Epoch 3] train=0.240705 val=0.255700 loss=2.032153 time: 92.476469
  INFO:root:[Epoch 4] train=0.262039 val=0.273600 loss=1.976788 time: 94.570572
  INFO:root:[Epoch 5] train=0.279728 val=0.302300 loss=1.915808 time: 91.655044
  INFO:root:[Epoch 6] train=0.295393 val=0.309900 loss=1.868357 time: 94.903087
  INFO:root:[Epoch 7] train=0.312901 val=0.331600 loss=1.825083 time: 94.501921
  INFO:root:[Epoch 8] train=0.330889 val=0.334100 loss=1.788333 time: 95.653459
  INFO:root:[Epoch 9] train=0.344211 val=0.349900 loss=1.757741 time: 94.065917

Is this fine?

@ChaiBapchya
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]
@apeforest

@ChaiBapchya can you paste the tests run log of opperf indicating they run fine w/o giving SIGSEGV

>>> import mxnet as mx
>>> mx.nd.signum_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), mom=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[ 2.2022064   0.7840038   1.0334405  ...  0.18898012 -0.5907004
 -1.4777215 ]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.signsgd_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[ 0.15278001  1.7198559   0.14636855 ...  0.3357248  -0.22160508
  1.5340825 ]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.sgd_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[ 1.6252067   0.22516885  0.00959079 ... -0.688654    0.6969211
  0.00631838]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.sgd_mom_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), mom=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[ 0.9833377  -0.75289315  0.58504266 ... -1.0496317  -0.08228261
 -1.7657199 ]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.rmspropalex_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), n=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01, g=mx.nd.random_normal(shape=(2**32 + 1)), delta=mx.nd.random_normal(shape=(2**32 + 1)))

[2.5003266         nan        nan ...        nan        nan 0.13144751]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.mp_sgd_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), weight32=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[ 1.1050267   0.6508057   0.13951734 ... -0.73946345  0.55659974
  1.9047947 ]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.mp_sgd_mom_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), mom=mx.nd.random_normal(shape=(2**32 + 1)), weight32=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[ 0.8880665  -1.852293    1.0043188  ... -0.5858472   0.554819
  0.26844773]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.ftml_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), d=mx.nd.random_normal(shape=(2**32 + 1)), v=mx.nd.random_normal(shape=(2**32 + 1)), z=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01, t=1)

[ 0.05790505 -0.819279           nan ...         nan         nan
         nan]
<NDArray 4294967297 @cpu(0)>
>>> mx.nd.adam_update(weight=mx.nd.random_normal(shape=(2**32 + 1)), grad=mx.nd.random_normal(shape=(2**32 + 1)), mean=mx.nd.random_normal(shape=(2**32 + 1)), var=mx.nd.random_normal(shape=(2**32 + 1)), lr=.01)

[       nan -1.8923444        nan ...  1.6588118        nan        nan]
<NDArray 4294967297 @cpu(0)>

Previously they all used to give SIGSEGV, now they don't @access2rohit

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Jan 31, 2020
@szhengac
Copy link
Contributor

szhengac commented Jan 31, 2020

So I tested MXNet (build from source using this branch)
with flags :

python -c "from mxnet.runtime import feature_list; print(feature_list())"
[✔ CUDA, ✔ CUDNN, ✖ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✔ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✔ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]

Results for training 10 epochs on 8 GPUS

INFO:root:[Epoch 0] train=0.120292 val=0.158000 loss=6.658037 time: 109.734473
  INFO:root:[Epoch 1] train=0.167548 val=0.179600 loss=2.297145 time: 92.212359
  INFO:root:[Epoch 2] train=0.210777 val=0.237700 loss=2.109626 time: 92.110430
  INFO:root:[Epoch 3] train=0.240705 val=0.255700 loss=2.032153 time: 92.476469
  INFO:root:[Epoch 4] train=0.262039 val=0.273600 loss=1.976788 time: 94.570572
  INFO:root:[Epoch 5] train=0.279728 val=0.302300 loss=1.915808 time: 91.655044
  INFO:root:[Epoch 6] train=0.295393 val=0.309900 loss=1.868357 time: 94.903087
  INFO:root:[Epoch 7] train=0.312901 val=0.331600 loss=1.825083 time: 94.501921
  INFO:root:[Epoch 8] train=0.330889 val=0.334100 loss=1.788333 time: 95.653459
  INFO:root:[Epoch 9] train=0.344211 val=0.349900 loss=1.757741 time: 94.065917

Is this fine?

Can you also test the optimizer op with a large sparse tensor? Currently, SGD, Adagrad, Adam, and FTRL support row_sparse weight and gradient.

@ChaiBapchya
Copy link
Contributor Author

>>> import mxnet as mx
>>> from mxnet.test_utils import *
>>> w = rand_ndarray((2**32+1,1), 'row_sparse', density=1)
>>> mx.nd.adam_update(w,w,w,w,lr=0.1)
[00:00:47] ../src/executor/../operator/../common/utils.h:472: Optimizer with lazy_update = True detected. Be aware that lazy update with row_sparse gradient is different from standard update, and may lead to different empirical results. See https://mxnet.apache.org/api/python/optimization/optimization.html for more details.

<RowSparseNDArray 4294967297x1 @cpu(0)>
>>> a=mx.nd.adam_update(w,w,w,w,lr=0.1)
>>> a

<RowSparseNDArray 4294967297x1 @cpu(0)>

@ChaiBapchya
Copy link
Contributor Author

@szhengac

Thanks for the help with passing sparse array
So this works as we discussed offline

import mxnet as mx
from mxnet.test_utils import *
w = rand_ndarray((2**32+1,1), 'row_sparse', density=1)
g = rand_ndarray((2**32+1,1), 'row_sparse', density=1)
m = mx.nd.zeros((2**32+1,1), stype='row_sparse')
v = mx.nd.zeros((2**32+1,1), stype='row_sparse')
ans=mx.nd.adam_update(w,g,m,v,lr=0.1)
ans.data.asnumpy()

Output

array([[ 0.19461787],
       [-0.3031752 ],
       [-0.18570909],
       ...,
       [        nan],
       [        nan],
       [        nan]], dtype=float32)

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apeforest apeforest merged commit b65db3c into apache:master Feb 10, 2020
zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
…tion (apache#17444)

* fix hard sigmoid

* change int i to index_t i for all Kernel Map functions

* fix lint

* size t indext fix
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
…tion (apache#17444)

* fix hard sigmoid

* change int i to index_t i for all Kernel Map functions

* fix lint

* size t indext fix
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants