-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742
add support for allowing some PyTorch tests to fail + print warning if one or more tests fail #2742
Conversation
… of test command (and not fail on non-zero exit)
2870f73
to
8638c59
Compare
In easybuilders/easybuild-framework#4001 there's the log parsing tool that might be useful here. Have you tried that on the log file to see what it would spit out? |
Gave it a quick try, the result is pretty horrible tbh...
It's always going to be better to check for specific patterns in output in a software-specific easyblock, imho, since you know what to look for (but I guess you're vulnerably to the output changing over time) |
One of the changes I need to make in that PR is that it should only happen on on the configure and build steps. Generally it is a poor match for finding issues outside of those steps. |
8638c59
to
8cf908e
Compare
8cf908e
to
da23e04
Compare
From the EB chat I understood that nothing will be printed to stdout if test step passes, but has several failures. I'm not an expert on this part of the framework code, but just my 2 cents on this: I think it would be preferable if we could get something to be printed. Ideally explicitely warning that the PT easyblock accepts up to X failures, that there were Y failures in this particular build, and that the user should check them to make sure he/she thinks they are 'acceptable'. Without that, users would probably assume "oh, test step completed, so everything has to be ok", even if there could still be 'valid' test failures in there. |
@casparvl As soon as the PyTorch test command fails (non-zero exit code), a warning message will be printed (to stderr). See this output (from a test report in #15137):
|
Oh ok, that looks fine to me :) Seeing that output, I'm only wondering one thing: should we give the user some advice (below the list of failing tests)? I.e. something along the lines of "The PyTorch test suite often has a small number of tests fail. As long as these are less than |
Test report by @boegel Overview of tested easyconfigs (in order)
Build succeeded for 1 out of 1 (1 easyconfigs in total) |
I don't really like the approach of allowing ANY test to fail via this new option. Why not disable the tests which we know are failing as we did before? |
Mainly because that's a very painful process, see how long easybuilders/easybuild-easyconfigs#15137 has been open due to flaky tests (and test reports that take half a day to run). If people want to be very strict and not allow any tests to fail, they can set |
Ideally we'd have no test failures, but we do see failures and often we see a test fail for only one system / setup. I consider this PR a suitable compromise and with a suitable message going into the standard output allowing people to investigate further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have no test failures, but we do see failures and often we see a test fail for only one system / setup. I consider this PR a suitable compromise and with a suitable message going into the standard output allowing people to investigate further.
I know. I just don't see the reason to run the tests if we allow any test to fail, not only those that are expected to/have failed before and ideally have an upstream issue to refer to.
Anyway see the review. Basically default to no expected failures and some improved handling of the error case. Especially not ignoring ec
as that could lead to a success when no tests have even started. So only case where ec
reports and error but the test-step is allowed to succeed is when we have failed_test_cnt <= max_failed_tests
Also a nit about the explicit re.compile
which I wouldn't use
…nstead of printing warning if no failed tests are allowed, don't compile regex used for findall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more round as I had trouble understanding what would be placed where in the message.
Test report by @boegel Overview of tested easyconfigs (in order)
Build succeeded for 1 out of 1 (1 easyconfigs in total) |
…hen some PyTorch tests have failed Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Thanks @boegel , this will hopefully help to get future PyTorch EasyConfigs merged before they actually release a new version ;-)
Test report by @boegel Overview of tested easyconfigs (in order)
Build succeeded for 1 out of 1 (1 easyconfigs in total) |
Draft because:
10
is a good default formax_failed_tests