-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Fix GPU Dynamic Op Schedule #7117
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nitpicks, but overall, it looks great, awesome work.
mod, | ||
[np_indices_result, np_valid_box_count], | ||
only_vm=False, | ||
disable_targets=["nvptx"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests the empty output VM change 👍
Why disable nvptx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is issue causing segfault from dynamic nms for nvptx, and generally we need thrust for any dynamic shape sorting. For now nvptx is not ready for these operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'm trying to fix the default sort kernel in #7099, if you want to take a look
@@ -199,6 +199,15 @@ def test_any_concat(): | |||
ref = np.concatenate([x_np - 3.0, y_np * 5.0], axis=0) | |||
check_result([x_np, y_np], mod, ref) | |||
|
|||
num_inputs = 25 | |||
x = [relay.var("x", shape=(relay.Any(),), dtype="float32") for _ in range(num_inputs)] | |||
z = relay.op.concatenate(x, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests the injective schedule 👍
@@ -754,7 +782,22 @@ def non_max_suppression( | |||
) | |||
score_axis = score_index | |||
score_shape = (batch_size, num_anchors) | |||
score_tensor = te.compute(score_shape, lambda i, j: data[i, j, score_axis], tag=tag.ELEMWISE) | |||
data_buf = tvm.tir.decl_buffer(data.shape, data.dtype, "data_buf", data_alignment=8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, but I'm a little surprised it's necessary. Do you have a test case that breaks the current code, or is this mostly for performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the nms workload is large like in RCNN models, general cuda injective schedule can still cause runtime error even with the improvement of this PR. It's common that any dynamic injective op can have runtime issue with current uniform cuda injective schedule.
This problem is not directly related to nms, but cuda injective schedule. Later we might need to revisit this part for gpu dynamic ops and have a better and more general solution(together with more tests).
@@ -194,6 +197,8 @@ def _callback(op): | |||
|
|||
if cfg.is_fallback: | |||
N, F, Y, X = get_const_tuple(conv.shape) | |||
if not isinstance(N, int): | |||
N = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test that hits this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we do have a test for this. Now I enabled all targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @mbrookhart |
* Fix GPU dynamic op schedules * Fix dynamic shape nms * Fix * Fix test format
* Fix GPU dynamic op schedules * Fix dynamic shape nms * Fix * Fix test format
* Fix GPU dynamic op schedules * Fix dynamic shape nms * Fix * Fix test format
* Fix GPU dynamic op schedules * Fix dynamic shape nms * Fix * Fix test format
This PR limits the resources used by dynamic shape gpu kernels to avoid runtime errors. It also skips
CallPacked
in vm if kernel has only one output and this output is empty, like (1, 0, 6).After this PR, TF and PT object detection models should be runnable on Nvidia GPU.
@zhiics @Laurawly @mbrookhart