Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TRACE_EVAL > 1 #1835

Merged
merged 4 commits into from
Jun 14, 2023
Merged

Fix TRACE_EVAL > 1 #1835

merged 4 commits into from
Jun 14, 2023

Conversation

umangyadav
Copy link
Member

TRACE_EVAL assumes all instructions always produce results on Target.
It tries to copy results from the target explicitly.

Some instructions like NMS , ROIAlign are run on the host in which case the following would break.

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/c900e382a1344ae286166708a5f77e5fe709370c/src/program.cpp#L558

For example when trying to run test_nms with MIGRAPHX_TRACE_EVAL=2, it fails on copy_from().

root@3355683a62d6:~/repo/AMDMIGraphX/build(fix_trace_eval)# ./bin/test_verify "test_nms"
[   RUN    ] test_nms
Run instruction: @0 = @literal{0} -> float_type, {1}, {0}, target_id=0
Time: 0.01743ms, 0.01776ms
Run instruction: @1 = @literal{0.5} -> float_type, {1}, {0}, target_id=0
Time: 0.00055ms, 0.00063ms
Run instruction: @2 = @literal{4} -> int64_type, {1}, {0}, target_id=0
Time: 0.00089ms, 0.00097ms
Run instruction: @3 = @literal{0.9, 0.75, 0.6, 0.95, 0.5, 0.3} -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
Time: 0.00067ms, 0.00075ms
Run instruction: boxes = @param:boxes -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 0.01173ms, 0.01186ms
Run instruction: @5 = ref::nonmaxsuppression[center_point_box=1,use_dyn_output=0](boxes,@3,@2,@1,@0) -> int64_type, {6, 3}, {3, 1}, target_id=0
Time: 0.15231ms, 0.15256ms
Output has normal, zero
Output: 0, 0, 3, 0, 0, ..., 0, 0, 0, 0, 0
Run instruction: @0 = check_context::migraphx::gpu::context  -> float_type, {}, {}, target_id=0
Time: 0.01013ms, 0.01098ms
Run instruction: boxes = @param:boxes -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 0.00129ms, 0.00185ms
Run instruction: @2 = hip::copy_from_gpu(boxes) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 4.32311ms, 4.33639ms
Output has normal
Output: -0.25, 0.75, -0.625, -0.0625, 0.625, ..., -0.75, 0.1875, -0.25, -0.5625, -0.5625
Run instruction: @3 = hip::hip_copy_literal[id=main:@literal:0] -> float_type, {1}, {0}, target_id=0
Time: 0.01117ms, 0.01179ms
Output has zero
Output: 0
Run instruction: @4 = hip::hip_copy_literal[id=main:@literal:1] -> float_type, {1}, {0}, target_id=0
Time: 0.00061ms, 0.00122ms
Output has normal
Output: 0.5
Run instruction: @5 = hip::hip_copy_literal[id=main:@literal:2] -> int64_type, {1}, {0}, target_id=0
Time: 0.00039ms, 0.00076ms
Output has normal
Output: 4
Run instruction: @6 = hip::hip_copy_literal[id=main:@literal:3] -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
Time: 0.00034ms, 0.00071ms
Output has normal
Output: 0.9, 0.75, 0.6, 0.95, 0.5, 0.3
Run instruction: @7 = hip::copy_from_gpu(@3) -> float_type, {1}, {0}, target_id=0
Time: 0.03816ms, 0.05154ms
Output has zero
Output: 0
Run instruction: @8 = hip::copy_from_gpu(@4) -> float_type, {1}, {0}, target_id=0
Time: 0.02761ms, 0.04129ms
Output has normal
Output: 0.5
Run instruction: @9 = hip::copy_from_gpu(@5) -> int64_type, {1}, {0}, target_id=0
Time: 0.03735ms, 0.05101ms
Output has normal
Output: 4
Run instruction: @10 = hip::copy_from_gpu(@6) -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
Time: 0.02604ms, 0.03947ms
Output has normal
Output: 0.9, 0.75, 0.6, 0.95, 0.5, 0.3
Run instruction: @11 = hip::sync_stream(@2,@10,@9,@8,@7) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 0.0096ms, 0.00994ms
Output has normal
Output: -0.25, 0.75, -0.625, -0.0625, 0.625, ..., -0.75, 0.1875, -0.25, -0.5625, -0.5625
Run instruction: @12 = nonmaxsuppression[center_point_box=1,use_dyn_output=0](@11,@10,@9,@8,@7) -> int64_type, {6, 3}, {3, 1}, target_id=0
Time: 0.02351ms, 0.02394ms

module: "main"
@0 = check_context::migraphx::gpu::context  -> float_type, {}, {}, target_id=0
boxes = @param:boxes -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
@2 = hip::copy_from_gpu(boxes) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
@3 = hip::hip_copy_literal[id=main:@literal:0] -> float_type, {1}, {0}, target_id=0
@4 = hip::hip_copy_literal[id=main:@literal:1] -> float_type, {1}, {0}, target_id=0
@5 = hip::hip_copy_literal[id=main:@literal:2] -> int64_type, {1}, {0}, target_id=0
@6 = hip::hip_copy_literal[id=main:@literal:3] -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
@7 = hip::copy_from_gpu(@3) -> float_type, {1}, {0}, target_id=0
@8 = hip::copy_from_gpu(@4) -> float_type, {1}, {0}, target_id=0
@9 = hip::copy_from_gpu(@5) -> int64_type, {1}, {0}, target_id=0
@10 = hip::copy_from_gpu(@6) -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
@11 = hip::sync_stream(@2,@10,@9,@8,@7) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
@12 = nonmaxsuppression[center_point_box=1,use_dyn_output=0](@11,@10,@9,@8,@7) -> int64_type, {6, 3}, {3, 1}, target_id=0
main:#output_0 = @param:main:#output_0 -> int64_type, {6, 3}, {3, 1}, target_id=0
@14 = hip::copy_to_gpu(@12,main:#output_0) -> int64_type, {6, 3}, {3, 1}, target_id=0
@15 = @return(@14), target_id=0


FAILED: test_nms
    what(): /home/umayadav/repo/AMDMIGraphX/src/targets/gpu/hip.cpp:152: read_from_gpu: Copy from gpu failed: invalid argument

Aborted (core dumped)

@TedThemistokleous TedThemistokleous added the bugfix Fixes a bug found in the code. label Jun 12, 2023
@CharlieL7
Copy link
Collaborator

Fixes the issue I was seeing with MIGRAPHX_TRACE_EVAL=2.

@codecov
Copy link

codecov bot commented Jun 12, 2023

Codecov Report

Merging #1835 (0ba0103) into develop (193f105) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

❗ Current head 0ba0103 differs from pull request most recent head 7fc1756. Consider uploading reports for the commit 7fc1756 to get more accurate results

@@             Coverage Diff             @@
##           develop    #1835      +/-   ##
===========================================
- Coverage    91.55%   91.52%   -0.03%     
===========================================
  Files          419      419              
  Lines        15481    15486       +5     
===========================================
  Hits         14173    14173              
- Misses        1308     1313       +5     
Impacted Files Coverage Δ
src/program.cpp 72.01% <0.00%> (-0.65%) ⬇️

@migraphx-bot
Copy link
Collaborator

migraphx-bot commented Jun 13, 2023

Test Batch Rate new
7fc175
Rate old
193f10
Diff Compare
torchvision-resnet50 64 893.45 893.68 -0.03%
torchvision-resnet50_fp16 64 5,252.57 5,280.96 -0.54%
torchvision-densenet121 32 1,122.80 1,122.87 -0.01%
torchvision-densenet121_fp16 32 3,271.22 3,271.80 -0.02%
torchvision-inceptionv3 32 591.69 591.19 0.08%
torchvision-inceptionv3_fp16 32 2,500.18 2,507.94 -0.31%
cadene-inceptionv4 16 328.42 328.75 -0.10%
cadene-resnext64x4 16 392.50 395.39 -0.73%
slim-mobilenet 64 7,120.83 7,117.35 0.05%
slim-nasnetalarge 64 159.56 159.56 0.00%
slim-resnet50v2 64 1,088.10 1,087.90 0.02%
bert-mrpc-onnx 8 717.44 717.26 0.02%
bert-mrpc-tf 1 368.06 369.35 -0.35%
pytorch-examples-wlang-gru 1 290.53 298.12 -2.55%
pytorch-examples-wlang-lstm 1 301.61 306.08 -1.46%
torchvision-resnet50_1 1 91.32 91.66 -0.37%
torchvision-inceptionv3_1 1 127.63 128.65 -0.80%
cadene-dpn92_1 1 333.48 333.16 0.10%
cadene-resnext101_1 1 235.91 235.44 0.20%
slim-vgg16_1 1 53.29 53.27 0.04%
slim-mobilenet_1 1 1,444.63 1,467.45 -1.55%
slim-inceptionv4_1 1 97.29 97.59 -0.31%
onnx-taau-downsample 1 316.12 315.83 0.09%
dlrm-criteoterabyte 1 20.96 21.00 -0.18%
dlrm-criteoterabyte_fp16 1 38.67 38.70 -0.07%
agentmodel 1 5,554.91 5,594.28 -0.70%
unet_fp16 2 52.35 52.42 -0.13%

This build is OK for merge ✅

@migraphx-bot
Copy link
Collaborator


    :white_check_mark:bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

    :white_check_mark:bert-mrpc-tf: PASSED: MIGraphX meets tolerance

    :white_check_mark:pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

    :white_check_mark:pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

    :white_check_mark:torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

    :white_check_mark:torchvision-inceptionv3_1: PASSED: MIGraphX meets tolerance

🔴cadene-dpn92_1: FAILED: MIGraphX is not within tolerance - check verbose output


    :white_check_mark:cadene-resnext101_1: PASSED: MIGraphX meets tolerance

    :white_check_mark:slim-vgg16_1: PASSED: MIGraphX meets tolerance

    :white_check_mark:slim-mobilenet_1: PASSED: MIGraphX meets tolerance

    :white_check_mark:slim-inceptionv4_1: PASSED: MIGraphX meets tolerance

    :white_check_mark:dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

    :white_check_mark:agentmodel: PASSED: MIGraphX meets tolerance

    :white_check_mark:unet: PASSED: MIGraphX meets tolerance

@umangyadav umangyadav merged commit 5bf067e into develop Jun 14, 2023
@umangyadav umangyadav deleted the fix_trace_eval branch June 14, 2023 16:39
TedThemistokleous pushed a commit that referenced this pull request Jun 23, 2023
* add fix for the trace_eval

* Add throw for the debug builds

* Formatting

---------

Co-authored-by: Chris Austen <causten@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a bug found in the code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants