-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Large Tensor] Implemented LT flag for OpPerf testing #17449
[Large Tensor] Implemented LT flag for OpPerf testing #17449
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor changes. Rest looks good. Good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
opperf (just by name) indicates this utility is used to test performance of operators. We could leverage its implementation to test large tensor correctness, but I am not sure if we should add this as a parameter to this utility. What value does it bring to users and who are going to use it?
If we are the only users who just need it to test large tensor correctness, we should keep this in a private branch. If we want to expose this functionality to users (again, please think about who the customers are and how they use it), it'll be better to extract into a separate function such as run_large_tensor_test or similar.
@apeforest thanks for your feedback! The purpose of this flag would not only be to test operator functionality on large tensor data, but also to test the actual performance of each operator on large tensor data (which falls within the mission of opperf). With this in mind, I believe it makes sense to add this as a parameter to the utility. This would be valuable to users who are interested in debugging their models' performance at the operator level on large tensor data, thereby helping users create more efficient models when handling high-dimensional data. I can refactor this into a general If the consensus is that this would be better as a private branch, I can move in that direction instead. |
Can users specify custom shapes to test the performance of large tensor instead of using a param? That gives more freedom to users. |
@apeforest
This flag serves as a quick way of testing for Large tensor Ops.
So ya, both are separate use cases and both are possible. |
Can you think of a use case where customer want such a quick way instead of specifying a custom shape to test an operator? If I were a customer and want to know if an operator would meet the requirement of my input tensor (could be large), I would just specify the shape and test it. Using a flag |
With this flag, users could effectively avoid having to create their own custom inputs for each operator, potentially saving them a significant amount of time and effort if they are testing multiple ops. The flag wouldn't be particularly useful if the customer has a specific input tensor shape in mind, but there must also be cases when customers want a quick way of obtaining a more general outlook on the performance of operators under large tensor conditions (e.g. for evaluating op performance differences across different machines and different input sizes). Would changing the name to |
@@ -39,6 +39,8 @@ def run_rearrange_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler=' | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify explicitly here the tensor size is over 2^32
@@ -48,6 +48,8 @@ def run_mx_binary_broadcast_operators_benchmarks(ctx=mx.cpu(), dtype='float32', | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@@ -75,6 +77,8 @@ def run_mx_binary_element_wise_operators_benchmarks(ctx=mx.cpu(), dtype='float32 | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@@ -44,6 +44,8 @@ def run_gemm_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='nativ | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
"transpose_a": True, | ||
"transpose_b": True}], | ||
warmup=warmup, runs=runs, profiler=profiler) | ||
if large_tensor == "on": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this flag is ON and user also specifies custom shapes (which is small tensor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this flag wouldn't be for use on user-specified shapes, it would be for general category and full suite testing of operator performance on input data with dimensions >= 2^32. If the user wanted to test individual operators with custom shapes, they would use run_performance_test()
and add their custom data as input - they wouldn't use the flag in that case, as the run_performance_test()
function doesn't take in the large_tensor
flag as an argument.
@@ -45,6 +45,8 @@ def run_activation_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler= | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
"transpose_a": True, | ||
"transpose_b": True}], | ||
warmup=warmup, runs=runs, profiler=profiler) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only difference between the if and else branch is the inputs
argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?
], | ||
warmup=warmup, | ||
runs=runs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only difference between the if and else branch is the inputs
argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?
"moving_var": (3,)}], | ||
warmup=warmup, | ||
runs=runs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only difference between the if and else branch is the inputs
argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?
], | ||
warmup=warmup, | ||
runs=runs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only difference between the if and else branch is the inputs
argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?
], | ||
warmup=warmup, | ||
runs=runs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only difference between the if and else branch is the inputs
argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?
], | ||
warmup=warmup, | ||
runs=runs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the only difference between the if and else branch is the inputs
argument. Can we only generate different inputs in the if/else branch and pass them to the same operator function?
@@ -46,6 +46,8 @@ def run_optimizer_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler=' | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more specific please.
@@ -44,6 +44,8 @@ def run_mx_random_sampling_operators_benchmarks(ctx=mx.cpu(), dtype='float32', p | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more specific please.
@@ -41,6 +41,8 @@ def run_mx_reduction_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profile | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more specific please.
@@ -39,6 +39,8 @@ def run_sorting_searching_operators_benchmarks(ctx=mx.cpu(), dtype='float32', pr | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more specific please.
@@ -45,6 +45,8 @@ def run_mx_unary_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='n | |||
Context to run benchmarks | |||
dtype: str, default 'float32' | |||
Precision to use for benchmarks | |||
large_tensor: str, default 'off' | |||
Tensor size to use for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more specific please.
11a8eb9
to
87c18fc
Compare
dc54f7b
to
d91dca8
Compare
…oggins/incubator-mxnet into opperf_large_tensor_flag
While full opperf suite was run initially (and has been linked in the description) Could you paste opperf results after commit 256ad70 Coz right now, with master (cuda, cudnn ON)
The PR which introduced lamb_update_phase1 to opperf #17542 worked for CUDA CUDNN ON |
@ChaiBapchya thanks for pointing this out. When I ran my tests with this PR on Friday, #17400 hadn't been merged into master yet so the conflicts did not appear. I believe your PR will fix these issues - thanks for your contribution! |
* Passing large_tensor parameter down * Adding large tensor testing functionality for convolutional operators * Added large tensor test functionality for conv ops * Fixing sizing for conv ops * Added gemm large tensor, print on conv * Updated input for gemm ops and print statements * Fixed deconv large tensor test * Added bias for deconv * Added test functionality for nn_activation and nn_basic ops * Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test * Dropped unnecessary print statements * Fixed lint errors * Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions * Adding docs, changed large_tensor to int64_tensor for clarity * Added warmup/runs to gemm ops, debugging process failure * Resolved merge conficts, added default params and input switching functionality * Dynamic input handling for default inputs, additional custom data for int64 * Fixed RPD issue * Everything through reduction ops working * Passing large_tensor parameter down * Adding large tensor testing functionality for convolutional operators * Added large tensor test functionality for conv ops * Fixing sizing for conv ops * Added gemm large tensor, print on conv * Updated input for gemm ops and print statements * Fixed deconv large tensor test * Added bias for deconv * Added test functionality for nn_activation and nn_basic ops * Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test * Dropped unnecessary print statements * Fixed lint errors * Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions * Adding docs, changed large_tensor to int64_tensor for clarity * Added warmup/runs to gemm ops, debugging process failure * Resolved merge conficts, added default params and input switching functionality * Dynamic input handling for default inputs, additional custom data for int64 * Fixed RPD issue * Everything through reduction ops working * Random sampling & loss ops working * Added indices, depth, ravel_data in default_params * Added indexing ops - waiting for merge on ravel * Added optimizer ops * All misc ops working * All NN Basic ops working * Fixed LT input for ROIPooling * Refactored NN Conv tests * Added test for inline optimizer ops * Dropping extra tests to decrease execution time * Switching to inline tests for RNN to support additional modes * Added state_cell as NDArray param, removed linalg testing for int64 tensor * Cleaned up styling * Fixed conv and deconv tests * Retrigger CI for continuous build * Cleaned up GEMM op inputs * Dropped unused param from default_params
* Passing large_tensor parameter down * Adding large tensor testing functionality for convolutional operators * Added large tensor test functionality for conv ops * Fixing sizing for conv ops * Added gemm large tensor, print on conv * Updated input for gemm ops and print statements * Fixed deconv large tensor test * Added bias for deconv * Added test functionality for nn_activation and nn_basic ops * Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test * Dropped unnecessary print statements * Fixed lint errors * Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions * Adding docs, changed large_tensor to int64_tensor for clarity * Added warmup/runs to gemm ops, debugging process failure * Resolved merge conficts, added default params and input switching functionality * Dynamic input handling for default inputs, additional custom data for int64 * Fixed RPD issue * Everything through reduction ops working * Passing large_tensor parameter down * Adding large tensor testing functionality for convolutional operators * Added large tensor test functionality for conv ops * Fixing sizing for conv ops * Added gemm large tensor, print on conv * Updated input for gemm ops and print statements * Fixed deconv large tensor test * Added bias for deconv * Added test functionality for nn_activation and nn_basic ops * Fixed deconv bias, implemented large tensor test logic for general ops, added default data for large tensor test * Dropped unnecessary print statements * Fixed lint errors * Added large_tensor parameter to existing function descriptions, added descriptions for functions missing descriptions * Adding docs, changed large_tensor to int64_tensor for clarity * Added warmup/runs to gemm ops, debugging process failure * Resolved merge conficts, added default params and input switching functionality * Dynamic input handling for default inputs, additional custom data for int64 * Fixed RPD issue * Everything through reduction ops working * Random sampling & loss ops working * Added indices, depth, ravel_data in default_params * Added indexing ops - waiting for merge on ravel * Added optimizer ops * All misc ops working * All NN Basic ops working * Fixed LT input for ROIPooling * Refactored NN Conv tests * Added test for inline optimizer ops * Dropping extra tests to decrease execution time * Switching to inline tests for RNN to support additional modes * Added state_cell as NDArray param, removed linalg testing for int64 tensor * Cleaned up styling * Fixed conv and deconv tests * Retrigger CI for continuous build * Cleaned up GEMM op inputs * Dropped unused param from default_params
Description
Completely reworked this PR to establish compatibility with the current master. In the weeks since this PR was originally created, over 100 ops have been added to OpPerf, so I added functionality for testing each one with large tensor (dimension >= 2**32) data while ensuring that the suite still worked properly on standard data.
I tested my changes extensively, merging my remaining PRs into this branch during testing to ensure that the full test suite worked with int64 tensor data on every op once all my kernel-level fixes were included.
This PR adds a flag (
int64-tensor
) and relevant default data to OpPerf for every supported op, thereby allowing users to run the entire suite of opperf tests with int64 tensor data after they build MXNet with int64 tensor support.Please note that the full suite takes an extremely long time (over one day) to run to completion on a machine with 748 GB of RAM, even with
warmup=1
andruns=1
.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Results
Full OpPerf Suite (CPU) - Small Tensor
Full OpPerf Suite (CPU) - Int64 Tensor w/changes from cumsum, multi_lars, and RNN PRs