-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[PERFORMANCE] [v1.x] Layer normalization code from Marian for CPU #19601
Conversation
Experiment with OMP_NUM_THREADS=4, times in s, c5.12xlarge |batchxchanne| New code | MKL | | 1x 32 | 0.0000288| 0.0000278| | 128x 32 | 0.0000308| 0.0000311| | 2560x 32 | 0.0000712| 0.0000672| | 4096x 32 | 0.0000946| 0.0000910| | 8192x 32 | 0.0001597| 0.0001523| |16384x 32 | 0.0002905| 0.0002619| | 1x 64 | 0.0000264| 0.0000256| | 128x 64 | 0.0000339| 0.0000330| | 2560x 64 | 0.0000829| 0.0000972| | 4096x 64 | 0.0001137| 0.0001356| | 8192x 64 | 0.0002027| 0.0002435| |16384x 64 | 0.0003715| 0.0004639| | 1x 128 | 0.0000262| 0.0000263| | 128x 128 | 0.0000325| 0.0000389| | 2560x 128 | 0.0001074| 0.0001580| | 4096x 128 | 0.0001505| 0.0002336| | 8192x 128 | 0.0002861| 0.0004481| |16384x 128 | 0.0005648| 0.0008613| | 1x 256 | 0.0000273| 0.0000276| | 128x 256 | 0.0000390| 0.0000431| | 2560x 256 | 0.0001533| 0.0002811| | 4096x 256 | 0.0002258| 0.0004300| | 8192x 256 | 0.0004300| 0.0008464| |16384x 256 | 0.0010436| 0.0017613| | 1x 512 | 0.0000256| 0.0000302| | 128x 512 | 0.0000408| 0.0000551| | 2560x 512 | 0.0002444| 0.0005225| | 4096x 512 | 0.0003828| 0.0008147| | 8192x 512 | 0.0008832| 0.0017192| |16384x 512 | 0.0058463| 0.0074497| | 1x 768 | 0.0000252| 0.0000308| | 128x 768 | 0.0000450| 0.0000676| | 2560x 768 | 0.0003440| 0.0007719| | 4096x 768 | 0.0005890| 0.0013346| | 8192x 768 | 0.0014946| 0.0026145| |16384x 768 | 0.0089495| 0.0113557| | 1x 1024 | 0.0000285| 0.0000308| | 128x 1024 | 0.0000487| 0.0000786| | 2560x 1024 | 0.0004614| 0.0010190| | 4096x 1024 | 0.0008083| 0.0017376| | 8192x 1024 | 0.0059020| 0.0075588| |16384x 1024 | 0.0116553| 0.0146855| Benchmark program ```python import mxnet as mx import time def time_procedure(shape, count): data = mx.nd.random_uniform(shape=shape, low=-1.0, high = 1.0) factors = mx.nd.random_uniform(shape=(shape[-1],)) mx.nd.waitall() begin = time.time() for i in range(0, count): out = mx.nd.LayerNorm(data, factors, factors) mx.nd.waitall() return (time.time() - begin) / count count = 200 for channel in [32, 64, 128, 256, 512, 768, 1024]: for batch in [1, 128, 2560, 4096, 8192, 16384]: s = (batch, channel) timing = time_procedure(s, count) print("{:5d}x{:5d} | {:.7f}".format(s[0], s[1], timing)) ```
Hey @kpuatamazon , Thanks for submitting the PR
CI supported jobs: [clang, windows-cpu, centos-cpu, miscellaneous, windows-gpu, unix-gpu, edge, sanity, unix-cpu, website, centos-gpu] Note: |
Lint broken?
|
Jenkins CI successfully triggered : [sanity] |
@mxnet-bot run ci [sanity] Maybe #19604 fixed lint? |
Jenkins CI successfully triggered : [sanity] |
@mxnet-bot run ci [centos-cpu, centos-gpu, clang, edge, miscellaneous, unix-cpu, unix-gpu, website, windows-cpu, windows-gpu] These have been "Expected" for days, seems the results got lost. |
@mxnet-bot run ci [website] Bot didn't respond, is anybody home? |
@mxnet-bot run ci [unix-cpu] #19081 seed 675318784 causes the test to fail in v1.x as well. |
Jenkins CI successfully triggered : [unix-cpu] |
I restarted the CI jobs a few times, looks like its passing now. Is it possible that the MKL implementation's performance might improve in the future? Should we keep that and hide it behind a build flag, making the Marian implementation default? |
Hi @samskalicky as requested there is now a My one-day-a-week contract ends 31 December 2020 so this is partly a goodbye and hope to get this in. I will be in today and probably 28 December. Afterwards, I am just @kpu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the MKL option, LGTM!
I've merged the latest v1.x in, added the Today is my last day. Hope it works. |
What are the next steps for this PR? Is this ready to be merged? |
Description
Adds a CPU kernel for LayerNorm that handles the common case of axis = -1. This is based upon the implementation from Marian at https://github.com/marian-nmt/marian-dev/blob/3b468e462809fe42a01a717c8d9307c465e6c35e/src/tensors/cpu/tensor_operators.cpp#L1047-L1087 .
Compared to the MXNet-internal generic implementation, the kernel is 1.6-29x faster. When used in Sockeye, end-to-end translation is 14%.
Compared to the MKL implementation, the kernel is 0.9-2.28x faster. Marian's is faster than MKL for all channels tested wider than 32.
Checklist
Essentials
test_operator.py:test_layer_norm
that covers this well and it passes.Changes
Benchmarks
Speed
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_MKLDNN=ON -DUSE_CUDA=OFF -DUSE_TVM_OP=OFF -DUSE_MKL_IF_AVAILABLE=OFF -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -GNinja
except for the MKL case when-DUSE_MKL_IF_AVAILABLE=ON
export OMP_NUM_THREADS=4
Benchmark program
Here are the results (in seconds). Yes, I included first run. Make your JIT faster.
AWS Sockeye
Observed a 14% speed up in end-to-end machine translation with Sockeye. Sockeye 2.2 (29795b82) on a c5.12xlarge with
export OMP_NUM_THREADS=4
translating a test set.Compiled on Ubuntu 18 with
cmake -DCMAKE_BUILD_TYPE=Release -DUSE_MKLDNN=ON -DUSE_CUDA=OFF -DUSE_TVM_OP=OFF -DUSE_MKL_IF_AVAILABLE=OFF -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -GNinja ..
Note: no MKL.Before
After
The above runs were done as normal, without the profiler. I then turned the profiler on. We can see that LayerNorm is consuming a substantial amount of time:
Before
After
The new implementation is 7.21x as fast on average according to the profiler.
The number of LayerNorm invocations changes 0.02% because beam search iterations are impacted by tie breaking.
Unit test
Before: 62.210s
After: 61.321s
But note unit tests spend most of their time comparing things rather than running the kernels.
Comments