-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Numpy] Fix collect_params().zero_grad() in gluon numpy interface #16716
Conversation
python/mxnet/gluon/parameter.py
Outdated
for ele in arr: | ||
ele[:] = 0 | ||
else: | ||
mx.nd.reset_arrays(*arr, num_arrays=len(arr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not always use in-place assign?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure why we used reset_arrays before. I guess that it would be faster if we use multiple arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's the case, then we need its equivalence in npx namespace @reminisce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an alias _npi_reset_arrays
in reset_arrays.cc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the source code. The new approach should be fine as long as we use cudaMemsetAsync
for implementing ele[()] = 0
. In fact, reset_arrays.cc
lies in the contrib
folder and there is no need to add it to numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is worth having diverging implementation. If reset_arrays is not useful then we should stay away from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move away from reset_array in the old ndarary too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do not know why we’ve used the reset_array. This op should be in the contrib while now it’s in the main API. I think this is somehow out-of-the-scope of this PR.
@reminisce @szha I've added the test. Should be ready for review |
@sxjscience This concern is not addressed yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we should have divergence in implementation in such simple task of resetting the gradients.
Because we need to use a[:]=0 for the original ndarray and use a[()] = 0 for the new numpy ndarray, we are not able to share the implementation. |
I guess the main purpose is to accelerate the speed of initializing a huge amount of NDArrays. Adding a reset_array op is appropriate as a short-term solution. However, numpy serves as the first step to our longer-term evolvement and we should consider to solve it in a different way. |
I think |
@szha If these operators are executed in the bulk mode there will be no StreamSynchrnoize in-between. I'm actually investigating the root of the overhead and I think it could be solved. Also, we could later fuse these initialization kernels into a single kernel. Adding a new operator is not the best way and would not suit our long-term goal, i.e., MXNet 2.0. |
I'm personally not fond of using the operator |
@reminisce No. That is not how it works and please do not undo performance optimizations because of such false presumptions (especially since this is not a user facing code, Let me give you a little bit of data and context - when pretraining BERT, zeroing of gradient arrays (which happened once per few iterations) took ~5% of time, because of this approach to launch each zeroing as a new operator. It is ridiculously high overhead. The actual GPU cost of this zeroing was minuscule, but the cost of the operator launch and sync at the end is 10+x that. About the fact that the new implementation uses cudaMemsetAsync per array - frankly that was a "good enough" fix for that (BERT) usecase. The actual full fix to this would mean writing an aggregated operator that would zero all the arrays inside a single kernel - I bet this would increase the performance of zeroing gradients by additional 2x or more. This comment:
together with
scare me, frankly. Doing numpy-like fully imperative execution has no chance of being actually fast and leaning into that direction will not make MXNet any better. Gluon seems like a good compromise - sure, do imperative if you just debug but hybridize when you are actually serious about getting something deployable. And the fix proposed in this PR: if this is NumPy-like array then do slow thing (especially since as I understand it numpy style will be promoted going forward) is super bad practice. |
@ptrendx I think you have misunderstood my comment. What I means is that we could try to fuse these zeroing operators into a single one instead of writing a new operator to do that. |
@sxjscience How would you like to fuse them without writing a new operator? |
@ptrendx For example, we can later try to hybridize part of the imperative codes by the nvrtc approach that you used in the fused op or rely on TVM. I'm not here to blame the reset_arrays op. My concern is that this feature is in |
@ptrendx The problem of the |
@ptrendx The the performance overhead in your benchmark really comes from the FFI and pushing ops to the async engine. It becomes more obvious when the kernel execution is negligible. We are working on reducing the operator calling overhead. Except that, just from the pure code analysis,
I'm not sure what is the |
Leaning towards numpy does not mean to throw away the mixing of symbolic and imperative. It's more about the front-end interface to be numpy-like. Also, doing slow things is not always bad. Sometimes it will improve the readability and make the package more researcher-friendly. |
Another big factor that may contribute to the slowdown of assigning zeros is through |
Ok, let me address those comments 1 point at a time :-).
|
@ptrendx
|
To avoid too frequent cuda stream synchronization for the arrays to be zeroed without introducing a new operator, I think we can put the assignment loop into the bulk scope so that there should be only one stream synchronization in the end. with mx.engine.bulk(len(arrays)):
for arr in arrays:
arr[:] = 0 |
@reminisce Huh, I did not know about that way of trying to bulk the imperative execution. You are right that if it worked well then that would solve this issue. Unfortunately, I tested it with this script:
and got those results:
(I also tried with the |
@ptrendx What's your suggestion for this PR? I think we can create another PR to promote the reset_arrays to the main API and also add numpy support if we find that it's the best solution. In fact, similar problem also happens for the global gradient clipping, in which we need to calculate the L2 norms of all the parameters. Currently, it's implemented as the summation of individual L2 norms: I have had the plan to write a large global_norm kernel to accelerate the speed of this part. The problem of this approach in general is that it won't be scalable. Consider the case of an arbitrary optimization algorithm, e.g., ADAM. ADAM has not used the block-wise structure of the parameters and update them by treating all parameters + gradients as a vector. We can certainly accelerate the speed by concatenating the parameters together and directly write a kernel for that. The |
@ptrendx Thanks for the script. I think a large part of overhead for zeroing ndarrays individually in Python comes from ndarray indexing, FFI, and pushing operators to the async engine. I modified your script a little bit to demonstrate the point. import mxnet as mx
import time
arrays = [mx.nd.ones((100,100), ctx=mx.gpu()) for _ in range(500)]
for a in arrays:
a[:] = 0
num_repeats = 10
mx.nd.waitall()
start = time.time()
#for _ in range(num_repeats):
for a in arrays:
mx.nd.zeros(a.shape, out=a)
end = time.time()
#print("async push per `mx.nd.zeros`: Elapsed ", (end - start) / num_repeats / len(arrays))
print("async push per `mx.nd.zeros`: Elapsed ", (end - start) / len(arrays))
mx.nd.waitall()
start = time.time()
for _ in range(num_repeats):
for a in arrays:
mx.nd.zeros(a.shape, out=a)
mx.nd.waitall()
end = time.time()
#print("normal: Elapsed ", (end - start) / num_repeats)
print("normal: Elapsed ", (end - start))
mx.nd.waitall()
start = time.time()
for _ in range(num_repeats):
with mx.engine.bulk(len(arrays)):
for a in arrays:
mx.nd.zeros(a.shape, out=a)
mx.nd.waitall()
end = time.time()
#print("bulk: Elapsed ", (end - start) / num_repeats)
print("bulk: Elapsed ", (end - start))
mx.nd.waitall()
start = time.time()
for _ in range(100):
mx.nd.reset_arrays(*arrays, num_arrays=len(arrays))
end = time.time()
print("async push per `reset_arrays`: Elapsed ", (end - start) / 100)
#print("reset_arrays: Elapsed ", (end - start) / num_repeats)
mx.nd.waitall()
start = time.time()
for _ in range(num_repeats):
mx.nd.reset_arrays(*arrays, num_arrays=len(arrays))
mx.nd.waitall()
end = time.time()
print("reset_arrays: Elapsed ", (end - start))
#print("reset_arrays: Elapsed ", (end - start) / num_repeats) and got results
If you calculate the overhead of invoking zeroing 500 ndarrays with 10 repeats (roughly excluding the kernel execution time), it's I agree in this situation, we should keep |
@szha Is it good to merge? |
@szha @ptrendx From the discussion, I think we have aligned to keep |
…ache#16716) * fix zero_grad * Update parameter.py * add test * fix
…, #16792) (#16832) * Fix nightly build (#16773) * Remove dependency on tvmop.conf * Fix binaries dependencies for ni nightly * Add comments * Update tvmop.py * Fix rebase * Fix (#16781) * Speed fused_op compilation by caching ptx and jit-compiled functions (#16783) * [Numpy] Fix collect_params().zero_grad() in gluon numpy interface (#16716) * fix zero_grad * Update parameter.py * add test * fix * Mixed data type binary ops (#16699) * support mixed-precision binary operations * improvement for documentations and error messages * Support boolean elemwise/broadcast binary add, multiply and true_divide (#16728) * support pure boolean elemwise/broadcast binary op * switch to unique_tpr * fix the test error * Fix rtrue_divide grad (#16769) * Fix rtrue_divide_scalar * More tests * Fix numpy-compatible mean output type for integer inputs (#16792) * fix mean output type for integer inputs * enable for windows
Description
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes