[Large Tensor] Implemented LT flag for OpPerf testing #17449

connorgoggins · 2020-01-27T18:44:30Z

Description

Completely reworked this PR to establish compatibility with the current master. In the weeks since this PR was originally created, over 100 ops have been added to OpPerf, so I added functionality for testing each one with large tensor (dimension >= 2**32) data while ensuring that the suite still worked properly on standard data.

I tested my changes extensively, merging my remaining PRs into this branch during testing to ensure that the full test suite worked with int64 tensor data on every op once all my kernel-level fixes were included.

This PR adds a flag (int64-tensor) and relevant default data to OpPerf for every supported op, thereby allowing users to run the entire suite of opperf tests with int64 tensor data after they build MXNet with int64 tensor support.

Please note that the full suite takes an extremely long time (over one day) to run to completion on a machine with 748 GB of RAM, even with warmup=1 and runs=1.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M benchmark/opperf/nd_operations/array_rearrange.py
M benchmark/opperf/nd_operations/binary_operators.py
M benchmark/opperf/nd_operations/gemm_operators.py
M benchmark/opperf/nd_operations/indexing_routines.py
M benchmark/opperf/nd_operations/linalg_operators.py
M benchmark/opperf/nd_operations/misc_operators.py
M benchmark/opperf/nd_operations/nn_activation_operators.py
M benchmark/opperf/nd_operations/nn_basic_operators.py
M benchmark/opperf/nd_operations/nn_conv_operators.py
M benchmark/opperf/nd_operations/nn_loss_operators.py
M benchmark/opperf/nd_operations/nn_optimizer_operators.py
M benchmark/opperf/nd_operations/random_sampling_operators.py
M benchmark/opperf/nd_operations/reduction_operators.py
M benchmark/opperf/nd_operations/sorting_searching_operators.py
M benchmark/opperf/nd_operations/unary_operators.py
M benchmark/opperf/opperf.py
M benchmark/opperf/rules/default_params.py
M benchmark/opperf/utils/benchmark_utils.py
M benchmark/opperf/utils/op_registry_utils.py

Results

Full OpPerf Suite (CPU) - Small Tensor
Full OpPerf Suite (CPU) - Int64 Tensor w/changes from cumsum, multi_lars, and RNN PRs

ChaiBapchya

minor changes. Rest looks good. Good job!

benchmark/opperf/nd_operations/gemm_operators.py

benchmark/opperf/nd_operations/nn_activation_operators.py

ChaiBapchya · 2020-01-27T19:46:26Z

Also let's wait before
#17445 and #17444 merge
So that adding large tensor flag will not break the existing opperf utility.

benchmark/opperf/nd_operations/gemm_operators.py

apeforest

Thanks for the contribution!

opperf (just by name) indicates this utility is used to test performance of operators. We could leverage its implementation to test large tensor correctness, but I am not sure if we should add this as a parameter to this utility. What value does it bring to users and who are going to use it?

If we are the only users who just need it to test large tensor correctness, we should keep this in a private branch. If we want to expose this functionality to users (again, please think about who the customers are and how they use it), it'll be better to extract into a separate function such as run_large_tensor_test or similar.

connorgoggins · 2020-01-28T19:18:20Z

@apeforest thanks for your feedback! The purpose of this flag would not only be to test operator functionality on large tensor data, but also to test the actual performance of each operator on large tensor data (which falls within the mission of opperf). With this in mind, I believe it makes sense to add this as a parameter to the utility.

This would be valuable to users who are interested in debugging their models' performance at the operator level on large tensor data, thereby helping users create more efficient models when handling high-dimensional data.

I can refactor this into a general run_large_tensor_test function if you would prefer, but I think users may sometimes want to test specific ops and categories of ops on large tensor data instead of being forced to test all ops at the same time.

If the consensus is that this would be better as a private branch, I can move in that direction instead.

apeforest · 2020-01-29T18:31:55Z

Can users specify custom shapes to test the performance of large tensor instead of using a param? That gives more freedom to users.

ChaiBapchya · 2020-01-29T23:56:40Z

@apeforest
Actually, if the mxnet is built with LTS ON then user can just give >2**32
as a custom shape and use the opperf utility.

import mxnet as mx
from mxnet import nd

from benchmark.opperf.utils.benchmark_utils import run_performance_test
run_performance_test(nd.add, run_backward=True, dtype='float32', ctx=mx.cpu(),
                               inputs=[{"lhs": (2**32+1, 1),
                                        "rhs": (2**32+1, 1)}],
                               warmup=0, runs=1)

This flag serves as a quick way of testing for Large tensor Ops.
So for example if user doesn't want to add custom shapes for each operator
and just wants to see perf times for all operators then this flag comes in
handy.

python incubator-mxnet/benchmark/opperf/opperf.py --output-format json --output-file mxnet_operator_benchmark_results.json --large-tensor ON

So ya, both are separate use cases and both are possible.
With the obvious assumption, mxnet is built with USE_INT64_TENSOR_SIZE = ON

apeforest · 2020-01-30T06:33:21Z

This flag serves as a quick way of testing for Large tensor Ops.

Can you think of a use case where customer want such a quick way instead of specifying a custom shape to test an operator? If I were a customer and want to know if an operator would meet the requirement of my input tensor (could be large), I would just specify the shape and test it. Using a flag --large_tensor is rather vague to me. What does it mean, how large is LARGE?

connorgoggins · 2020-01-30T18:31:18Z

With this flag, users could effectively avoid having to create their own custom inputs for each operator, potentially saving them a significant amount of time and effort if they are testing multiple ops. The flag wouldn't be particularly useful if the customer has a specific input tensor shape in mind, but there must also be cases when customers want a quick way of obtaining a more general outlook on the performance of operators under large tensor conditions (e.g. for evaluating op performance differences across different machines and different input sizes).

Would changing the name to int64_tensor introduce more clarity?

apeforest · 2020-01-31T22:03:34Z

benchmark/opperf/nd_operations/array_rearrange.py

@@ -39,6 +39,8 @@ def run_rearrange_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='
        Context to run benchmarks
    dtype: str, default 'float32'
        Precision to use for benchmarks
+    large_tensor: str, default 'off'
+        Tensor size to use for tests


Please specify explicitly here the tensor size is over 2^32

apeforest · 2020-01-31T22:03:59Z

benchmark/opperf/nd_operations/binary_operators.py

@@ -48,6 +48,8 @@ def run_mx_binary_broadcast_operators_benchmarks(ctx=mx.cpu(), dtype='float32',
        Context to run benchmarks
    dtype: str, default 'float32'
        Precision to use for benchmarks
+    large_tensor: str, default 'off'
+        Tensor size to use for tests


apeforest · 2020-01-31T22:04:11Z

benchmark/opperf/nd_operations/binary_operators.py

@@ -75,6 +77,8 @@ def run_mx_binary_element_wise_operators_benchmarks(ctx=mx.cpu(), dtype='float32
        Context to run benchmarks
    dtype: str, default 'float32'
        Precision to use for benchmarks
+    large_tensor: str, default 'off'
+        Tensor size to use for tests


apeforest · 2020-01-31T22:04:32Z

benchmark/opperf/nd_operations/gemm_operators.py

@@ -44,6 +44,8 @@ def run_gemm_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler='nativ
        Context to run benchmarks
    dtype: str, default 'float32'
        Precision to use for benchmarks
+    large_tensor: str, default 'off'
+        Tensor size to use for tests


apeforest · 2020-01-31T22:05:07Z

benchmark/opperf/nd_operations/gemm_operators.py

-                 "transpose_a": True,
-                 "transpose_b": True}],
-        warmup=warmup, runs=runs, profiler=profiler)
+    if large_tensor == "on":


What happens if this flag is ON and user also specifies custom shapes (which is small tensor).

The purpose of this flag wouldn't be for use on user-specified shapes, it would be for general category and full suite testing of operator performance on input data with dimensions >= 2^32. If the user wanted to test individual operators with custom shapes, they would use run_performance_test() and add their custom data as input - they wouldn't use the flag in that case, as the run_performance_test() function doesn't take in the large_tensor flag as an argument.

apeforest · 2020-01-31T22:05:46Z

benchmark/opperf/nd_operations/nn_activation_operators.py

@@ -45,6 +45,8 @@ def run_activation_operators_benchmarks(ctx=mx.cpu(), dtype='float32', profiler=
        Context to run benchmarks
    dtype: str, default 'float32'
        Precision to use for benchmarks
+    large_tensor: str, default 'off'
+        Tensor size to use for tests


apeforest · 2020-01-31T22:08:12Z

benchmark/opperf/nd_operations/gemm_operators.py

+                     "transpose_a": True,
+                     "transpose_b": True}],
+            warmup=warmup, runs=runs, profiler=profiler)
+    else: