-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Discussion] Overhead in MXNet Execution #14883
Comments
This comment has been minimized.
This comment has been minimized.
MXNet function call overhead is quite high Code attached for benchmarking the function call overhead.
|
issue also reported here: #8112 |
Some profiling results: Here I used NaiveEngine
|
The overhead is |
In P3.2x: With NaiveEngine:
With ThreadedEngine
|
One idea that gained some popularity after discussion is to introduce an engine-less mode to MXNet, in which the operators are exposed in API and dispatched in a similar way as pytorch. Given that Naive Engine option should be quite close to this already, we should verify the overhead in the Naive Engine mode and judge the necessity based on that result. Since the target is to measure the overhead, we will want to control the performance difference in operators. |
@sxjscience what do the differences between Naive and ThreadedEngine reflect in the other frameworks? |
proposal in #17097 |
i just would like to comment a bit about the tvm related benchmark. The copyto function was not really a good benchmark for TVM's FFI cost mainly because that function was implemented directly via ctypes because we didn't find it to be a bottleneck. This functionality could have been turned into cython(which is used to support the core functionality), which could speedup the overhead of the call. The current best way test the FFI is through Packed function calls. PackedFunc is the place where most of the API functions are exposed, here I cross posted a quick benchmark on #17097 Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case. import timeit
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat) Results on my laptop(macbook pro 13inch)
|
To followup this again, the patch apache/tvm#4549 fixed the copyto issue by cythonize it. Here is an updated script to add benchmark of copyto import timeit
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop(1, 2)')
timer.timeit(1)
num_repeat = 1000
print("TVM-nop:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
"""
timer = timeit.Timer(setup=setup,
stmt='np.copyto(y, x)')
timer.timeit(1)
print("numpy.copyto:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
x = tvm.nd.array(x)
y = tvm.nd.array(y)
"""
timer = timeit.Timer(setup=setup,
stmt='x.copyto(y)')
timer.timeit(1)
print("tvm.copyto:", timer.timeit(num_repeat) / num_repeat) Results on my laptop(macbook pro 13inch)
|
Hi,
I'm starting this thread to collect issues that are likely caused by various overhead in MXNet execution so that we can focus the efforts in solving them. The focus is everything other than the operators and kvstores.
If you've seen performance issues that you found were due to the overhead in MXNet execution (e.g. ctypes/FFI call, asynchronous execution, graph construction overhead, memory release and clean-up), please either report it in comments or link relevant issue.
If you're unsure, please create a performance issue report in another github issue instead.
Be sure to include these details:
The text was updated successfully, but these errors were encountered: