Fix TRACE_EVAL > 1 #1835

umangyadav · 2023-06-12T20:48:25Z

TRACE_EVAL assumes all instructions always produce results on Target.
It tries to copy results from the target explicitly.

Some instructions like NMS , ROIAlign are run on the host in which case the following would break.

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/c900e382a1344ae286166708a5f77e5fe709370c/src/program.cpp#L558

For example when trying to run test_nms with MIGRAPHX_TRACE_EVAL=2, it fails on copy_from().

root@3355683a62d6:~/repo/AMDMIGraphX/build(fix_trace_eval)# ./bin/test_verify "test_nms"
[   RUN    ] test_nms
Run instruction: @0 = @literal{0} -> float_type, {1}, {0}, target_id=0
Time: 0.01743ms, 0.01776ms
Run instruction: @1 = @literal{0.5} -> float_type, {1}, {0}, target_id=0
Time: 0.00055ms, 0.00063ms
Run instruction: @2 = @literal{4} -> int64_type, {1}, {0}, target_id=0
Time: 0.00089ms, 0.00097ms
Run instruction: @3 = @literal{0.9, 0.75, 0.6, 0.95, 0.5, 0.3} -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
Time: 0.00067ms, 0.00075ms
Run instruction: boxes = @param:boxes -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 0.01173ms, 0.01186ms
Run instruction: @5 = ref::nonmaxsuppression[center_point_box=1,use_dyn_output=0](boxes,@3,@2,@1,@0) -> int64_type, {6, 3}, {3, 1}, target_id=0
Time: 0.15231ms, 0.15256ms
Output has normal, zero
Output: 0, 0, 3, 0, 0, ..., 0, 0, 0, 0, 0
Run instruction: @0 = check_context::migraphx::gpu::context  -> float_type, {}, {}, target_id=0
Time: 0.01013ms, 0.01098ms
Run instruction: boxes = @param:boxes -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 0.00129ms, 0.00185ms
Run instruction: @2 = hip::copy_from_gpu(boxes) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 4.32311ms, 4.33639ms
Output has normal
Output: -0.25, 0.75, -0.625, -0.0625, 0.625, ..., -0.75, 0.1875, -0.25, -0.5625, -0.5625
Run instruction: @3 = hip::hip_copy_literal[id=main:@literal:0] -> float_type, {1}, {0}, target_id=0
Time: 0.01117ms, 0.01179ms
Output has zero
Output: 0
Run instruction: @4 = hip::hip_copy_literal[id=main:@literal:1] -> float_type, {1}, {0}, target_id=0
Time: 0.00061ms, 0.00122ms
Output has normal
Output: 0.5
Run instruction: @5 = hip::hip_copy_literal[id=main:@literal:2] -> int64_type, {1}, {0}, target_id=0
Time: 0.00039ms, 0.00076ms
Output has normal
Output: 4
Run instruction: @6 = hip::hip_copy_literal[id=main:@literal:3] -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
Time: 0.00034ms, 0.00071ms
Output has normal
Output: 0.9, 0.75, 0.6, 0.95, 0.5, 0.3
Run instruction: @7 = hip::copy_from_gpu(@3) -> float_type, {1}, {0}, target_id=0
Time: 0.03816ms, 0.05154ms
Output has zero
Output: 0
Run instruction: @8 = hip::copy_from_gpu(@4) -> float_type, {1}, {0}, target_id=0
Time: 0.02761ms, 0.04129ms
Output has normal
Output: 0.5
Run instruction: @9 = hip::copy_from_gpu(@5) -> int64_type, {1}, {0}, target_id=0
Time: 0.03735ms, 0.05101ms
Output has normal
Output: 4
Run instruction: @10 = hip::copy_from_gpu(@6) -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
Time: 0.02604ms, 0.03947ms
Output has normal
Output: 0.9, 0.75, 0.6, 0.95, 0.5, 0.3
Run instruction: @11 = hip::sync_stream(@2,@10,@9,@8,@7) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
Time: 0.0096ms, 0.00994ms
Output has normal
Output: -0.25, 0.75, -0.625, -0.0625, 0.625, ..., -0.75, 0.1875, -0.25, -0.5625, -0.5625
Run instruction: @12 = nonmaxsuppression[center_point_box=1,use_dyn_output=0](@11,@10,@9,@8,@7) -> int64_type, {6, 3}, {3, 1}, target_id=0
Time: 0.02351ms, 0.02394ms

module: "main"
@0 = check_context::migraphx::gpu::context  -> float_type, {}, {}, target_id=0
boxes = @param:boxes -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
@2 = hip::copy_from_gpu(boxes) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
@3 = hip::hip_copy_literal[id=main:@literal:0] -> float_type, {1}, {0}, target_id=0
@4 = hip::hip_copy_literal[id=main:@literal:1] -> float_type, {1}, {0}, target_id=0
@5 = hip::hip_copy_literal[id=main:@literal:2] -> int64_type, {1}, {0}, target_id=0
@6 = hip::hip_copy_literal[id=main:@literal:3] -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
@7 = hip::copy_from_gpu(@3) -> float_type, {1}, {0}, target_id=0
@8 = hip::copy_from_gpu(@4) -> float_type, {1}, {0}, target_id=0
@9 = hip::copy_from_gpu(@5) -> int64_type, {1}, {0}, target_id=0
@10 = hip::copy_from_gpu(@6) -> float_type, {1, 1, 6}, {6, 6, 1}, target_id=0
@11 = hip::sync_stream(@2,@10,@9,@8,@7) -> float_type, {1, 6, 4}, {24, 4, 1}, target_id=0
@12 = nonmaxsuppression[center_point_box=1,use_dyn_output=0](@11,@10,@9,@8,@7) -> int64_type, {6, 3}, {3, 1}, target_id=0
main:#output_0 = @param:main:#output_0 -> int64_type, {6, 3}, {3, 1}, target_id=0
@14 = hip::copy_to_gpu(@12,main:#output_0) -> int64_type, {6, 3}, {3, 1}, target_id=0
@15 = @return(@14), target_id=0


FAILED: test_nms
    what(): /home/umayadav/repo/AMDMIGraphX/src/targets/gpu/hip.cpp:152: read_from_gpu: Copy from gpu failed: invalid argument

Aborted (core dumped)

CharlieL7 · 2023-06-12T21:53:10Z

Fixes the issue I was seeing with MIGRAPHX_TRACE_EVAL=2.

codecov · 2023-06-12T22:18:47Z

Codecov Report

Merging #1835 (0ba0103) into develop (193f105) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

❗ Current head 0ba0103 differs from pull request most recent head 7fc1756. Consider uploading reports for the commit 7fc1756 to get more accurate results

@@             Coverage Diff             @@
##           develop    #1835      +/-   ##
===========================================
- Coverage    91.55%   91.52%   -0.03%     
===========================================
  Files          419      419              
  Lines        15481    15486       +5     
===========================================
  Hits         14173    14173              
- Misses        1308     1313       +5

Impacted Files	Coverage Δ
src/program.cpp	`72.01% <0.00%> (-0.65%)`	⬇️

migraphx-bot · 2023-06-13T03:57:17Z

Test	Batch	Rate new 7fc175	Rate old 193f10	Diff	Compare
torchvision-resnet50	64	893.45	893.68	-0.03%	✅
torchvision-resnet50_fp16	64	5,252.57	5,280.96	-0.54%	✅
torchvision-densenet121	32	1,122.80	1,122.87	-0.01%	✅
torchvision-densenet121_fp16	32	3,271.22	3,271.80	-0.02%	✅
torchvision-inceptionv3	32	591.69	591.19	0.08%	✅
torchvision-inceptionv3_fp16	32	2,500.18	2,507.94	-0.31%	✅
cadene-inceptionv4	16	328.42	328.75	-0.10%	✅
cadene-resnext64x4	16	392.50	395.39	-0.73%	✅
slim-mobilenet	64	7,120.83	7,117.35	0.05%	✅
slim-nasnetalarge	64	159.56	159.56	0.00%	✅
slim-resnet50v2	64	1,088.10	1,087.90	0.02%	✅
bert-mrpc-onnx	8	717.44	717.26	0.02%	✅
bert-mrpc-tf	1	368.06	369.35	-0.35%	✅
pytorch-examples-wlang-gru	1	290.53	298.12	-2.55%	✅
pytorch-examples-wlang-lstm	1	301.61	306.08	-1.46%	✅
torchvision-resnet50_1	1	91.32	91.66	-0.37%	✅
torchvision-inceptionv3_1	1	127.63	128.65	-0.80%	✅
cadene-dpn92_1	1	333.48	333.16	0.10%	✅
cadene-resnext101_1	1	235.91	235.44	0.20%	✅
slim-vgg16_1	1	53.29	53.27	0.04%	✅
slim-mobilenet_1	1	1,444.63	1,467.45	-1.55%	✅
slim-inceptionv4_1	1	97.29	97.59	-0.31%	✅
onnx-taau-downsample	1	316.12	315.83	0.09%	✅
dlrm-criteoterabyte	1	20.96	21.00	-0.18%	✅
dlrm-criteoterabyte_fp16	1	38.67	38.70	-0.07%	✅
agentmodel	1	5,554.91	5,594.28	-0.70%	✅
unet_fp16	2	52.35	52.42	-0.13%	✅

This build is OK for merge ✅

migraphx-bot · 2023-06-13T03:57:19Z

:white_check_mark:bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark:bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark:pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark:pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark:torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark:torchvision-inceptionv3_1: PASSED: MIGraphX meets tolerance

🔴cadene-dpn92_1: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark:cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark:slim-vgg16_1: PASSED: MIGraphX meets tolerance

:white_check_mark:slim-mobilenet_1: PASSED: MIGraphX meets tolerance

:white_check_mark:slim-inceptionv4_1: PASSED: MIGraphX meets tolerance

:white_check_mark:dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark:agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark:unet: PASSED: MIGraphX meets tolerance

* add fix for the trace_eval * Add throw for the debug builds * Formatting --------- Co-authored-by: Chris Austen <causten@users.noreply.github.com>

add fix for the trace_eval

c381cc3

umangyadav requested review from kahmed10, CharlieL7 and TedThemistokleous June 12, 2023 20:48

umangyadav added 2 commits June 12, 2023 20:55

Add throw for the debug builds

858d427

Formatting

02b2e74

TedThemistokleous assigned umangyadav Jun 12, 2023

TedThemistokleous added the bugfix Fixes a bug found in the code. label Jun 12, 2023

TedThemistokleous approved these changes Jun 12, 2023

View reviewed changes

CharlieL7 approved these changes Jun 12, 2023

View reviewed changes

kahmed10 approved these changes Jun 12, 2023

View reviewed changes

Merge branch 'develop' into fix_trace_eval

7fc1756

umangyadav merged commit 5bf067e into develop Jun 14, 2023

umangyadav deleted the fix_trace_eval branch June 14, 2023 16:39

TedThemistokleous mentioned this pull request Jun 23, 2023

MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877

Closed

TedThemistokleous pushed a commit that referenced this pull request Jun 23, 2023

Fix TRACE_EVAL > 1 (#1835)

abfcd4c

* add fix for the trace_eval * Add throw for the debug builds * Formatting --------- Co-authored-by: Chris Austen <causten@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TRACE_EVAL > 1 #1835

Fix TRACE_EVAL > 1 #1835

umangyadav commented Jun 12, 2023

CharlieL7 commented Jun 12, 2023

codecov bot commented Jun 12, 2023 •

edited

Loading

migraphx-bot commented Jun 13, 2023 •

edited

Loading

migraphx-bot commented Jun 13, 2023

Fix TRACE_EVAL > 1 #1835

Fix TRACE_EVAL > 1 #1835

Conversation

umangyadav commented Jun 12, 2023

CharlieL7 commented Jun 12, 2023

codecov bot commented Jun 12, 2023 • edited Loading

Codecov Report

migraphx-bot commented Jun 13, 2023 • edited Loading

migraphx-bot commented Jun 13, 2023

codecov bot commented Jun 12, 2023 •

edited

Loading

migraphx-bot commented Jun 13, 2023 •

edited

Loading