Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742

Merged
merged 7 commits into from
Jul 7, 2022

Conversation

boegel
Copy link
Member

@boegel boegel commented Jun 9, 2022

Draft because:

  • needs more testing
  • need to check whether 10 is a good default for max_failed_tests
  • maybe we should also look into a way of getting the warning pushed "up" so it's included in the comment that is posted for PR test reports (but that will require some changes in framework too)

… of test command (and not fail on non-zero exit)
@boegel boegel added this to the next release (4.5.6?) milestone Jun 9, 2022
@boegel boegel force-pushed the pytorch_allow_failed_tests branch 2 times, most recently from 2870f73 to 8638c59 Compare June 9, 2022 07:36
@ocaisa
Copy link
Member

ocaisa commented Jun 9, 2022

In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out?

@boegel
Copy link
Member Author

boegel commented Jun 9, 2022

In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out?

Gave it a quick try, the result is pretty horrible tbh...
Here's the first and last block out of output produced (out of 10,000 lines in total...)

Log parsed and 1745 errors and 112 warnings found
     75       test_with_rpc_names (__main__.TestShardedTensorEnumerable) ... ok
     76       test_init_from_local_shards (__main__.TestShardedTensorFromLocalShards) ... ok
     77       test_init_from_local_shards_invalid_shards (__main__.TestShardedTensorFromLocalShards) ... ok
     78       test_init_from_local_shards_invalid_shards_gaps (__main__.TestShardedTensorFromLocalShards) ... ok
     79       test_init_from_local_shards_invalid_shards_overlap (__main__.TestShardedTensorFromLocalShards) ... ok
     80       test_init_from_local_shards_new_group (__main__.TestShardedTensorFromLocalShards) ... ok
  >> 81       test_serialize_and_deserialize (__main__.TestShardedTensorMetadata) ... /tmp/vsc40023/easybuild_build/PyTorch/1.10.0/foss-2021a/pytorch/test/distributed/_sharded_tensor/test_sharded_tensor.py:110: DeprecationWarning: `np.bool` is a deprecated alias for the
builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
     82       Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
     83         pickled_obj = pickle.dumps(expected_st_metadata)
     84       ok
     85
     86       ----------------------------------------------------------------------
     87       Ran 40 tests in 199.157s
     ...
  >> 52127    test_lower_graph_linear (quantization.eager.test_quantize_eager_ptq.TestQuantizeONNXExport) ... WARNING: The shape inference of _caffe2::Int8Quantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it
in symbolic function.
  >> 52128    WARNING: The shape inference of _caffe2::Int8FC type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52129    WARNING: The shape inference of _caffe2::Int8Dequantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52130    WARNING: The shape inference of _caffe2::Int8Quantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52131    WARNING: The shape inference of _caffe2::Int8GivenTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52132    WARNING: The shape inference of _caffe2::Int8GivenIntTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52133    WARNING: The shape inference of _caffe2::Int8FC type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52134    WARNING: The shape inference of _caffe2::Int8Dequantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52135    WARNING: The shape inference of _caffe2::Int8Quantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52136    WARNING: The shape inference of _caffe2::Int8GivenTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52137    WARNING: The shape inference of _caffe2::Int8GivenIntTensorFill type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52138    WARNING: The shape inference of _caffe2::Int8FC type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
  >> 52139    WARNING: The shape inference of _caffe2::Int8Dequantize type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
     52140    ok
     52141    test_qconv1d (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52142    test_qconv1d_unpack (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52143    test_qconv2d (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52144    test_qconv2d_unpack (quantization.core.test_quantized_op.TestQuantizedConv) ... ok
     52145    test_qconv3d (quantization.core.test_quantized_op.TestQuantizedConv) ... ok

It's always going to be better to check for specific patterns in output in a software-specific easyblock, imho, since you know what to look for (but I guess you're vulnerably to the output changing over time)

@branfosj
Copy link
Member

branfosj commented Jun 9, 2022

In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out?

One of the changes I need to make in that PR is that it should only happen on on the configure and build steps. Generally it is a poor match for finding issues outside of those steps.

@boegel boegel force-pushed the pytorch_allow_failed_tests branch from 8638c59 to 8cf908e Compare June 9, 2022 08:34
@casparvl
Copy link
Contributor

From the EB chat I understood that nothing will be printed to stdout if test step passes, but has several failures. I'm not an expert on this part of the framework code, but just my 2 cents on this:

I think it would be preferable if we could get something to be printed. Ideally explicitely warning that the PT easyblock accepts up to X failures, that there were Y failures in this particular build, and that the user should check them to make sure he/she thinks they are 'acceptable'. Without that, users would probably assume "oh, test step completed, so everything has to be ok", even if there could still be 'valid' test failures in there.

@boegel
Copy link
Member Author

boegel commented Jun 17, 2022

@casparvl As soon as the PyTorch test command fails (non-zero exit code), a warning message will be printed (to stderr).
When one or more failing tests were found based on the regex, the test names will be listed too.

See this output (from a test report in #15137):

WARNING: Test command had non-zero exit code (1)!


WARNING: 5 tests (out of 89230) failed:
* distributed/fsdp/test_fsdp_input
* distributed/fsdp/test_fsdp_memory
* distributed/fsdp/test_fsdp_overlap
* distributed/test_c10d_gloo
* test_autograd

@boegel boegel marked this pull request as ready for review June 20, 2022 14:52
@casparvl
Copy link
Contributor

Oh ok, that looks fine to me :)

Seeing that output, I'm only wondering one thing: should we give the user some advice (below the list of failing tests)? I.e. something along the lines of "The PyTorch test suite often has a small number of tests fail. As long as these are less than self.cfg['max_failed_tests'], EasyBuild completes the installation, since it is often these tests that are broken, and not your installation. However, you may want to double check that your installation works as expected." (I'm not actually sure how to phrase it, but some indication of what to do with this warning)

@boegel
Copy link
Member Author

boegel commented Jun 24, 2022

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.9.0-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3902.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 510.73.08, Python 3.6.8
See https://gist.github.com/a52aeee7a6a9c7ba03bb5f9272d6f1a2 for a full test report.

@Flamefire
Copy link
Contributor

I don't really like the approach of allowing ANY test to fail via this new option. Why not disable the tests which we know are failing as we did before?

@boegel
Copy link
Member Author

boegel commented Jun 29, 2022

I don't really like the approach of allowing ANY test to fail via this new option. Why not disable the tests which we know are failing as we did before?

Mainly because that's a very painful process, see how long easybuilders/easybuild-easyconfigs#15137 has been open due to flaky tests (and test reports that take half a day to run).
We're talking 5 tests over 90k that are causing a lot of headaches.

If people want to be very strict and not allow any tests to fail, they can set max_failed_tests to 0 via a hook

@branfosj
Copy link
Member

Ideally we'd have no test failures, but we do see failures and often we see a test fail for only one system / setup. I consider this PR a suitable compromise and with a suitable message going into the standard output allowing people to investigate further.

Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have no test failures, but we do see failures and often we see a test fail for only one system / setup. I consider this PR a suitable compromise and with a suitable message going into the standard output allowing people to investigate further.

I know. I just don't see the reason to run the tests if we allow any test to fail, not only those that are expected to/have failed before and ideally have an upstream issue to refer to.

Anyway see the review. Basically default to no expected failures and some improved handling of the error case. Especially not ignoring ec as that could lead to a success when no tests have even started. So only case where ec reports and error but the test-step is allowed to succeed is when we have failed_test_cnt <= max_failed_tests
Also a nit about the explicit re.compile which I wouldn't use

easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
boegel added 2 commits July 6, 2022 10:40
…nstead of printing warning if no failed tests are allowed, don't compile regex used for findall
Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more round as I had trouble understanding what would be placed where in the message.

easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
easybuild/easyblocks/p/pytorch.py Outdated Show resolved Hide resolved
@boegel
Copy link
Member Author

boegel commented Jul 7, 2022

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3303.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.73.08, Python 3.6.8
See https://gist.github.com/c78cdbcbd6ed472981c0775973ac69c1 for a full test report.

…hen some PyTorch tests have failed

Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
casparvl pushed a commit to sara-nl/easybuild-easyconfigs that referenced this pull request Jul 7, 2022
Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Thanks @boegel , this will hopefully help to get future PyTorch EasyConfigs merged before they actually release a new version ;-)

@casparvl casparvl merged commit ac68b02 into easybuilders:develop Jul 7, 2022
@boegel boegel deleted the pytorch_allow_failed_tests branch July 7, 2022 16:15
@boegel
Copy link
Member Author

boegel commented Jul 7, 2022

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.3.1-foss-2019b-Python-3.7.4.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3622.doduo.os - Linux RHEL 8.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/155048595da30306c187d3f31513dddf for a full test report.

@boegel boegel changed the title allow some PyTorch tests to fail + print warning if one or more tests fail add support for allowing some PyTorch tests to fail + print warning if one or more tests fail Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants