add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742

boegel · 2022-06-09T07:33:55Z

Draft because:

needs more testing
need to check whether 10 is a good default for max_failed_tests
maybe we should also look into a way of getting the warning pushed "up" so it's included in the comment that is posted for PR test reports (but that will require some changes in framework too)

… of test command (and not fail on non-zero exit)

ocaisa · 2022-06-09T07:44:20Z

In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out?

boegel · 2022-06-09T08:03:46Z

In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out?

Gave it a quick try, the result is pretty horrible tbh...
Here's the first and last block out of output produced (out of 10,000 lines in total...)

Log parsed and 1745 errors and 112 warnings found
     75       test_with_rpc_names (__main__.TestShardedTensorEnumerable) ... ok
     76       test_init_from_local_shards (__main__.TestShardedTensorFromLocalShards) ... ok
     77       test_init_from_local_shards_invalid_shards (__main__.TestShardedTensorFromLocalShards) ... ok
     78       test_init_from_local_shards_invalid_shards_gaps (__main__.TestShardedTensorFromLocalShards) ... ok
     79       test_init_from_local_shards_invalid_shards_overlap (__main__.TestShardedTensorFromLocalShards) ... ok
     80       test_init_from_local_shards_new_group (__main__.TestShardedTensorFromLocalShards) ... ok
  >> 81       test_serialize_and_deserialize (__main__.TestShardedTensorMetadata) ... /tmp/vsc40023/easybuild_build/PyTorch/1.10.0/foss-2021a/pytorch/test/distributed/_sharded_tensor/test_sharded_tensor.py:110: DeprecationWarning: `np.bool` is a deprecated alias for the
builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
     82       Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
     83         pickled_obj = pickle.dumps(expected_st_metadata)
     84       ok
     85
     86       ----------------------------------------------------------------------
     87       Ran 40 tests in 199.157s
     ...

  >> 52127    test_lower_graph_linear (quantization.eager.test_quantize_eager_ptq.TestQuantizeONNXExport) ... WARNING: The shape inference of _caffe2::Int8Quantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it
in symbolic function.
  >> 52128    WARNING: The shape inference of _caffe2::Int8FC type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52129    WARNING: The shape inference of _caffe2::Int8Dequantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52130    WARNING: The shape inference of _caffe2::Int8Quantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52131    WARNING: The shape inference of _caffe2::Int8GivenTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52132    WARNING: The shape inference of _caffe2::Int8GivenIntTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52133    WARNING: The shape inference of _caffe2::Int8FC type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52134    WARNING: The shape inference of _caffe2::Int8Dequantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52135    WARNING: The shape inference of _caffe2::Int8Quantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52136    WARNING: The shape inference of _caffe2::Int8GivenTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52137    WARNING: The shape inference of _caffe2::Int8GivenIntTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52138    WARNING: The shape inference of _caffe2::Int8FC type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52139    WARNING: The shape inference of _caffe2::Int8Dequantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
     52140    ok
     52141    test_qconv1d (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52142    test_qconv1d_unpack (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52143    test_qconv2d (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52144    test_qconv2d_unpack (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52145    test_qconv3d (quantization.core.test_quantized_op.TestQuantizedConv) ... ok

It's always going to be better to check for specific patterns in output in a software-specific easyblock, imho, since you know what to look for (but I guess you're vulnerably to the output changing over time)

branfosj · 2022-06-09T08:17:59Z

In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out?

One of the changes I need to make in that PR is that it should only happen on on the configure and build steps. Generally it is a poor match for finding issues outside of those steps.

… fail

…est name per line)

casparvl · 2022-06-15T13:04:02Z

From the EB chat I understood that nothing will be printed to stdout if test step passes, but has several failures. I'm not an expert on this part of the framework code, but just my 2 cents on this:

I think it would be preferable if we could get something to be printed. Ideally explicitely warning that the PT easyblock accepts up to X failures, that there were Y failures in this particular build, and that the user should check them to make sure he/she thinks they are 'acceptable'. Without that, users would probably assume "oh, test step completed, so everything has to be ok", even if there could still be 'valid' test failures in there.

boegel · 2022-06-17T14:54:18Z

@casparvl As soon as the PyTorch test command fails (non-zero exit code), a warning message will be printed (to stderr).
When one or more failing tests were found based on the regex, the test names will be listed too.

See this output (from a test report in #15137):

WARNING: Test command had non-zero exit code (1)!


WARNING: 5 tests (out of 89230) failed:
* distributed/fsdp/test_fsdp_input
* distributed/fsdp/test_fsdp_memory
* distributed/fsdp/test_fsdp_overlap
* distributed/test_c10d_gloo
* test_autograd

casparvl · 2022-06-20T15:19:16Z

Oh ok, that looks fine to me :)

Seeing that output, I'm only wondering one thing: should we give the user some advice (below the list of failing tests)? I.e. something along the lines of "The PyTorch test suite often has a small number of tests fail. As long as these are less than self.cfg['max_failed_tests'], EasyBuild completes the installation, since it is often these tests that are broken, and not your installation. However, you may want to double check that your installation works as expected." (I'm not actually sure how to phrase it, but some indication of what to do with this warning)

boegel · 2022-06-24T20:17:31Z

Test report by @boegel

Overview of tested easyconfigs (in order)

SUCCESS PyTorch-1.9.0-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3902.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 510.73.08, Python 3.6.8
See https://gist.github.com/a52aeee7a6a9c7ba03bb5f9272d6f1a2 for a full test report.

Flamefire · 2022-06-27T09:39:13Z

I don't really like the approach of allowing ANY test to fail via this new option. Why not disable the tests which we know are failing as we did before?

boegel · 2022-06-29T06:22:56Z

I don't really like the approach of allowing ANY test to fail via this new option. Why not disable the tests which we know are failing as we did before?

Mainly because that's a very painful process, see how long easybuilders/easybuild-easyconfigs#15137 has been open due to flaky tests (and test reports that take half a day to run).
We're talking 5 tests over 90k that are causing a lot of headaches.

If people want to be very strict and not allow any tests to fail, they can set max_failed_tests to 0 via a hook

branfosj · 2022-06-29T09:44:54Z

Ideally we'd have no test failures, but we do see failures and often we see a test fail for only one system / setup. I consider this PR a suitable compromise and with a suitable message going into the standard output allowing people to investigate further.

Flamefire

Ideally we'd have no test failures, but we do see failures and often we see a test fail for only one system / setup. I consider this PR a suitable compromise and with a suitable message going into the standard output allowing people to investigate further.

I know. I just don't see the reason to run the tests if we allow any test to fail, not only those that are expected to/have failed before and ideally have an upstream issue to refer to.

Anyway see the review. Basically default to no expected failures and some improved handling of the error case. Especially not ignoring ec as that could lead to a success when no tests have even started. So only case where ec reports and error but the test-step is allowed to succeed is when we have failed_test_cnt <= max_failed_tests
Also a nit about the explicit re.compile which I wouldn't use

easybuild/easyblocks/p/pytorch.py

…nstead of printing warning if no failed tests are allowed, don't compile regex used for findall

…idance

Flamefire

One more round as I had trouble understanding what would be placed where in the message.

easybuild/easyblocks/p/pytorch.py

boegel · 2022-07-07T07:42:15Z

Test report by @boegel

Overview of tested easyconfigs (in order)

SUCCESS PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3303.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/c78cdbcbd6ed472981c0775973ac69c1 for a full test report.

…hen some PyTorch tests have failed Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>

…uild-easyblocks#2742

casparvl

lgtm!

Thanks @boegel , this will hopefully help to get future PyTorch EasyConfigs merged before they actually release a new version ;-)

boegel · 2022-07-07T19:53:18Z

Test report by @boegel

Overview of tested easyconfigs (in order)

SUCCESS PyTorch-1.3.1-foss-2019b-Python-3.7.4.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3622.doduo.os - Linux RHEL 8.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/155048595da30306c187d3f31513dddf for a full test report.

enhance PythonPackage.test_step to allow returning output + exit code…

75ae3af

… of test command (and not fail on non-zero exit)

boegel added enhancement change labels Jun 9, 2022

boegel added this to the next release (4.5.6?) milestone Jun 9, 2022

boegel force-pushed the pytorch_allow_failed_tests branch 2 times, most recently from 2870f73 to 8638c59 Compare June 9, 2022 07:36

boegel force-pushed the pytorch_allow_failed_tests branch from 8638c59 to 8cf908e Compare June 9, 2022 08:34

allow some PyTorch tests to fail + print warning if one or more tests…

da23e04

… fail

boegel force-pushed the pytorch_allow_failed_tests branch from 8cf908e to da23e04 Compare June 9, 2022 08:38

boegel mentioned this pull request Jun 9, 2022

{devel}[foss/2021a] PyTorch v1.11.0 w/ Python 3.9.5 w/ CUDA 11.3.1 easybuilders/easybuild-easyconfigs#15137

Merged

boegel added 2 commits June 10, 2022 08:52

tweak warning message for failed PyTorch tests for readability (one t…

29df4f5

…est name per line)

make sure that list of failed PyTorch tests in warning is sorted

ec07018

boegel marked this pull request as ready for review June 20, 2022 14:52

Flamefire suggested changes Jun 29, 2022

View reviewed changes

boegel added 2 commits July 6, 2022 10:40

don't allow any failed tests by default for PyTorch, raise an error i…

82011bf

…nstead of printing warning if no failed tests are allowed, don't compile regex used for findall

extend warning message about failing PyTorch tests to provide some gu…

1c22b86

…idance

Flamefire suggested changes Jul 7, 2022

View reviewed changes

easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved

easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved

easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved

clean up code by pre-formatting first part of warning/error message w…

a5b55a0

…hen some PyTorch tests have failed Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>

casparvl pushed a commit to sara-nl/easybuild-easyconfigs that referenced this pull request Jul 7, 2022

Allow for 10 tests to fail using the new EasyBlock easybuilders/easyb…

2d31528

…uild-easyblocks#2742

casparvl approved these changes Jul 7, 2022

View reviewed changes

casparvl merged commit ac68b02 into easybuilders:develop Jul 7, 2022

boegel deleted the pytorch_allow_failed_tests branch July 7, 2022 16:15

boegel changed the title ~~allow some PyTorch tests to fail + print warning if one or more tests fail~~ add support for allowing some PyTorch tests to fail + print warning if one or more tests fail Jul 8, 2022

boegel mentioned this pull request Aug 5, 2022

enhance PythonPackage easyblock to make sure all test command output makes it to the EasyBuild log, also when return_output_ec=True #2770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742

add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742

boegel commented Jun 9, 2022

ocaisa commented Jun 9, 2022

boegel commented Jun 9, 2022

branfosj commented Jun 9, 2022

casparvl commented Jun 15, 2022

boegel commented Jun 17, 2022 •

edited

Loading

casparvl commented Jun 20, 2022

boegel commented Jun 24, 2022

Flamefire commented Jun 27, 2022

boegel commented Jun 29, 2022

branfosj commented Jun 29, 2022

Flamefire left a comment

Flamefire left a comment

boegel commented Jul 7, 2022

casparvl left a comment

boegel commented Jul 7, 2022

add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742

add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742

Conversation

boegel commented Jun 9, 2022

ocaisa commented Jun 9, 2022

boegel commented Jun 9, 2022

branfosj commented Jun 9, 2022

casparvl commented Jun 15, 2022

boegel commented Jun 17, 2022 • edited Loading

casparvl commented Jun 20, 2022

boegel commented Jun 24, 2022

Overview of tested easyconfigs (in order)

Flamefire commented Jun 27, 2022

boegel commented Jun 29, 2022

branfosj commented Jun 29, 2022

Flamefire left a comment

Choose a reason for hiding this comment

Flamefire left a comment

Choose a reason for hiding this comment

boegel commented Jul 7, 2022

Overview of tested easyconfigs (in order)

casparvl left a comment

Choose a reason for hiding this comment

boegel commented Jul 7, 2022

Overview of tested easyconfigs (in order)

boegel commented Jun 17, 2022 •

edited

Loading