Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Discussion] Overhead in MXNet Execution #14883

Open
szha opened this issue May 5, 2019 · 12 comments
Open

[Discussion] Overhead in MXNet Execution #14883

szha opened this issue May 5, 2019 · 12 comments

Comments

@szha
Copy link
Member

szha commented May 5, 2019

Hi,

I'm starting this thread to collect issues that are likely caused by various overhead in MXNet execution so that we can focus the efforts in solving them. The focus is everything other than the operators and kvstores.

If you've seen performance issues that you found were due to the overhead in MXNet execution (e.g. ctypes/FFI call, asynchronous execution, graph construction overhead, memory release and clean-up), please either report it in comments or link relevant issue.

If you're unsure, please create a performance issue report in another github issue instead.

Be sure to include these details:

  • the performance issue.
  • what you tried to solve it.
  • timings you've collected if any.
@mxnet-label-bot

This comment has been minimized.

@rvardimon
Copy link

#15148

@Jerryzcn
Copy link
Contributor

Jerryzcn commented Aug 30, 2019

MXNet function call overhead is quite high
NumPy call overhead: 0.7 microsecond
TVM call overhead: 3.4 microsecond
MXNet call overhead: 28.3 microsecond

Code attached for benchmarking the function call overhead.

import numpy as np
from matplotlib import pyplot as plt
from IPython import display

def benchmark(func, n_start, n_end, n_stride=1):
    avg_times, sizes = [], (2**np.arange(n_start, n_end, n_stride))
    np.random.seed(0)
    for size in sizes:
        avg_times.append(func(size))
    return sizes, np.array(avg_times)

def np_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    res = %timeit -o -q np.copyto(y, x)
    return res.average

_, times = benchmark(np_copy, 1, 8)
print('NumPy call overhead: %.1f microsecond'% (times.mean()*1e6,))

import tvm

def tvm_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    x, y = tvm.nd.array(x), tvm.nd.array(y)
    res = %timeit -o -q x.copyto(y)
    return res.average

_, times = benchmark(tvm_copy, 1, 8)
print('TVM call overhead: %.1f microsecond'% (times.mean()*1e6,))

import mxnet as mx

def mx_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    x, y = mx.nd.array(x), mx.nd.array(y)
    res = %timeit -o -q x.copyto(y)
    return res.average

_, times = benchmark(mx_copy, 1, 8)
print('MXNet call overhead: %.1f microsecond'% (times.mean()*1e6,))

@Jerryzcn
Copy link
Contributor

Jerryzcn commented Aug 30, 2019

issue also reported here: #8112

@sxjscience
Copy link
Member

Some profiling results:
https://gist.github.com/sxjscience/f069d49b250820be4483debfa9a8d4b5

Here I used NaiveEngine

MXNet: 25.77195848737444
JAX: 58.22208949497768
Numpy: 1.0632106236049108
PyTorch: 13.021196637834823
TensorFlow: 21.734169551304408

@sxjscience
Copy link
Member

The overhead is 33.87945038931711 after we used the default Engine (comment NaiveEngine line)

@sxjscience
Copy link
Member

sxjscience commented Sep 11, 2019

In P3.2x:

With NaiveEngine:

MXNet: 49.990619931902195
JAX: 132.80538150242398
TVM: 4.161664417811803
Numpy: 1.8179416656494143
PyTorch: 20.977394921439036
TensorFlow: 47.540392194475444

With ThreadedEngine

MXNet: 61.272689274379196
JAX: 141.04713712419783
TVM: 4.231657300676618
Numpy: 1.918860844203404
PyTorch: 20.21517072405134
TensorFlow: 44.17456899370466

@szha
Copy link
Member Author

szha commented Nov 27, 2019

One idea that gained some popularity after discussion is to introduce an engine-less mode to MXNet, in which the operators are exposed in API and dispatched in a similar way as pytorch.

Given that Naive Engine option should be quite close to this already, we should verify the overhead in the Naive Engine mode and judge the necessity based on that result. Since the target is to measure the overhead, we will want to control the performance difference in operators.

@Kh4L
Copy link
Contributor

Kh4L commented Nov 27, 2019

@sxjscience what do the differences between Naive and ThreadedEngine reflect in the other frameworks?

@eric-haibin-lin
Copy link
Member

proposal in #17097

@tqchen
Copy link
Member

tqchen commented Dec 19, 2019

i just would like to comment a bit about the tvm related benchmark. The copyto function was not really a good benchmark for TVM's FFI cost mainly because that function was implemented directly via ctypes because we didn't find it to be a bottleneck. This functionality could have been turned into cython(which is used to support the core functionality), which could speedup the overhead of the call.

The current best way test the FFI is through Packed function calls. PackedFunc is the place where most of the API functions are exposed, here I cross posted a quick benchmark on #17097

Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case.

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)

$ TVM_FFI=cython python3 test.py
9.749500000000299e-08
$ TVM_FFI=ctypes python3 test.py
7.226789000000011e-06

@tqchen
Copy link
Member

tqchen commented Dec 19, 2019

To followup this again, the patch apache/tvm#4549 fixed the copyto issue by cythonize it. Here is an updated script to add benchmark of copyto

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(1, 2)')
timer.timeit(1)
num_repeat = 1000
print("TVM-nop:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
"""

timer = timeit.Timer(setup=setup,
                     stmt='np.copyto(y, x)')
timer.timeit(1)
print("numpy.copyto:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
x = tvm.nd.array(x)
y = tvm.nd.array(y)
"""

timer = timeit.Timer(setup=setup,
                     stmt='x.copyto(y)')
timer.timeit(1)
print("tvm.copyto:", timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)

$TVM_FFI=ctypes python test.py
TVM-nop: 5.864649e-06
numpy.copyto: 3.7303100000002364e-07
tvm.copyto: 8.198690000000008e-07

$ TVM_FFI=cython python test.py
TVM-nop: 1.1101499999999209e-07
numpy.copyto: 2.519029999999978e-07
tvm.copyto: 2.3881700000000226e-07

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants