-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark for flaky tests #814
Comments
I agree with everything you wrote, but I fear the backward compatibility implications of making currently passing test suites with flaky tests (marked as |
I'm not sure I understand - all I'm proposing is adding a new |
Sorry, I meant about this bit:
|
Oh, sorry - now I understand. That's only about the summary color though - IMHO it's acceptable that those are red on xpassed tests, as long as the exit status still is 0. It should be a good reminder for people to start using the |
[Thanks, @The-Compiler, for keeping this discussion going.]
At first I agreed about this, but I'm having serious second thoughts[1]. I've just done an exhaustive search of the docs for related strings: "xfail", "xpass", "flaky", "intermittent". What I've found is that all description of xfail's usage says that it's for tests that cannot succeed -- not a single mention of tests that might not succeed.[2] So I have to wonder: how appropriate is it to preserve backward compatibility for the benefit of those who used the feature contrary to its documentation, at the expense of those who used it as the documentation says it's intended to be used? I absolutely agree on the need to be able to mark flaky tests; but once such a marker is available, does xfail really have to remain so ambiguous that it's not really useful for its originally intended, and still documented, purpose? [1] prompted largely by my failure to come up with nearly as good a name as "xfail" for a way to mark hard-failing tests [2] The only references to xfail for flaky tests are pretty oblique:
Migration strategyThis is intended to minimize disruption to both camps.
This would leave us in (what I consider to be) a desirable end state:
|
Thanks for writing such a comprehensive proposal @esiegerman! Perhaps you could send this proposal to the mailing list? Most users don't subscribe to the repository issues, and I think the mailing list might be a better media for the discussion. |
Also see #259 for some related issues - it seems the behaviour that |
I can't say I'm a huge fan of pytest implicitly encouraging people to leave intermittently failing tests in their test suite. A test that you don't care if it passes or fails? May as well be a skip. Or just delete it, it's not telling you anything! Intermittent fails are interesting and important because they tend to reveal where your tests have hidden dependencies. At least in the current state you are forced to confront them and hopefully find the cause of the flakiness. To me this is a pretty big change, to come from literally one comment from @hpk42 - I'd like to hear more about his use case exactly before this gets much more involved. Sorry if this comes across as overly negative - but basically I think the rationale for inclusion needs a more rigorous examination. I'm open to being convinced. :) |
Hi @pfctdayelise, Flaky tests are usually introduced in a code base due to complex code being tested (say multi threaded or asynchronous code) or due to platform differences (works on Windows but occasionally breaks on Linux, and vice-versa). Sometimes the flakiness is due to a problem in the test itself and the code it tests is fine and known to work. Flaky tests are of course a problem and should be dealt with as soon as possible in order to keep the test suite working and healthy. The major problem with flaky tests is that they disrupt other developer's builds. It's frustrating for developers working on a piece of code in a feature branch, for their build to fail because of a completely unrelated test which was working fine in the previous build. Sometimes the developers which originally worked on the failing test are around and can take a look right away, but more often than not they might be busy with something else, like in a meeting or too entrenched in some other feature to help right away. Our process when someone encounters a flaky test is to:
This process prevents flaky tests, which unfortunately do happen, from disrupting the workflow of the other developers and makes it clear that there are issues to be solved in the test suite. Having said all that, I think the proposal of having a specific mark for Hope my points make some sense, and sorry for the long post! 😅 |
OK, what you have said makes sense, but it sounds like an argument for showing xpass results more prominently (which I think is what happened in #776?) (which I also agree with). Actually #716 sounds like xfail behaviour has recently changed. If this is the case, I don't know what's going on anywhere any more! My understanding of Jenkins is that it expects your test suite to "pass" (all tests act as expected) or "fail". There is no in between. I don't know if Travis or other CI systems have more nuance there. If all tests act as expected, IME nobody looks at the results. OK so maybe what's being proposed is this: Have I got it right? |
Having thought about it some more, maybe another idea would be to have a strict=True/False parameter for the xfail mark. (If strict, an xpass result should break the build.) If a new mark is introduced, it would need to be mindful of how it interacts with things like the --runxfail option. |
Yep. 😄
That's a good point. Perhaps keeping a single mark with a flag is the way to go. |
xfail originally expected the test to fail, if it didn't and xpass instead then the test job was failed with non zero exit code and red. This was changed, apparently to handle flaky tests. I'm sure I found this on the mailing list as to why it was changed. I'm sure I also found the version that it changed in. Sorry didn't keep the references to them. There appears to be concern about breaking backward compatibility now that xfail is used for flaky tests and can either xfail or xpass with zero exit code and green. However the flip side is that backward compatibility has already been broken in the first change. Test suites that were expecting xfail to expect the test to fail and if it xpass then failed the test job with non zero exit code and red are now broken. Even though it is a silent failure, those tests are no longer working as they were written. So changing xfail back to the original strict behaviour then put up with tests failing loudly because they are flaky and need the mark updated. Alternatively keeping xfail with current flaky behaviour then put up with tests silently xpass which hides problems and need the mark updated. One is loud and one is silent but they are both important. Given that both options have caused backward compatibility problems I'd prefer the semantic naming to make sense. In #716 I suggested xfail and xfailorxpass for flaky. Having xfail and flaky is fine (and possibly better than xfailorxpass). If xfail is going to be overloaded with a flag and used for both then perhaps we can have a global setting in conftest rather than a default of flaky. On our project I changed our conftest to detect xpass tests and set the exit code back to non zero. We have over 10,000 tests in pytest which is run under jenkins and we need xfails to match the expected outcome which is to fail and if for some reason it xpass then we need the non zero exit code so that we can be alerted to this as a failed build. |
+1 on xpass being considered a problem. That confused me when I first started using py.test. |
I'm starting to think that breaking the compatibility is OK, as fixing it usually is simple enough. I would like to see this in How about:
I volunteer to work on a PR once we agree on what should be implemented. |
I like @pfctdayelise's suggestion of adding But I also think her argument for not introducing a flaky mark is fairly sound, would it make sense and be possible to implement the flaky mark outside of the core, i.e. as a plugin? We could still refer to that plugin in the official xfail documentation. |
Preference for separate marker since both have clear separate semantics. Would suggest consistent with xfail options so flaky marker and --runflaky option. That said I'm fine with the option to overload xfail though I'd like to see a default in conftest / ini so we can get back to the original behaviour before it changed to allow any result. |
Just noticed that |
Sorry I didnt notice this issue before and just implemented my own plugin to deal with "flaky" tests. https://github.com/schettino72/pytest-ignore-flaky My approach is to report a flaky test normally as |
A harsh note as opposing view - flaky tests are a indication of fundamental flaws, I'd prefer a lib with retry utilities that has test framework integration |
Agreed, but please note that |
That would be useful, but as it stands we are close to having something that is also useful in its own right. (So that other thing which would be useful should not be considered a blocker to making smaller progress in the mean time.) XPASSes should be considered a fail (as they were several releases ago), and with the plugin flaky mark they would not be considered a fail. So, given this plugin, it seems to me the task required is to make XPASSes considered a fail, and highlight this plugin in release notes and docs. |
Hmmm agree... people which currently rely on |
Yes, flaky tests indicate a flaw but they are a fact of life that many projects have to live with. I think both ignoring and re-running tests approaches are useful. Let's say a test depends on an external resource that is temporarily unavailable, even if you re-try the tests it will keep failing because the retries will happen in a short span of time. |
from my pov a library that does exponential backoff retry, then fails the test with a xfail reporting the external ressource as gone missing is much preferable over rerunning a whole test n times then getting a worse explanation |
I agree with the suggestion that xpasses should cause a failing exit code. I personally believe that flaky tests should either be fixed or deleted, but I do understand the desire to ignore or retry tests marked as flaky. I would even consider using such a plugin myself knowing that these ignores/retries would apply to tests that I have explicitly marked and therefore understand the cost. Having a reason supported in the mark that's included in the console/reports would also be useful, to encourage understanding the cause of flakyness, and to explain it to other test authors. |
@RonnyPfannschmidt @schettino72 @davehunt There already exists multiple plugins with support for re-running flaky tests: https://github.com/gocept/pytest-rerunfailures is one. As someone who has to deal with a flaky product, flaky tests should be avoided at all costs. It only provides misinformation, and leads to devs & testers ignoring actual failures. I would have to go in on the side of 'XPASS = FAIL', and not bother with any core pytest functionality for supporting flakiness. If people want support for it, there's plenty as plugins. |
These are my thoughts exactly. I like the idea of a |
Just to add to the discussion, my opinion is that xfail should really fail if the test doesn't raise an error. Flaky tests shouldn't be run at all as they don't add anything useful to the run (thus, the proper way here would be skipping the test if it's flaky) -- making something such as xfail (which to me reads as xfail: expected to fail) mixed with flakiness seems a bit weird... as for backward compatibility, I wouldn't worry so much... if people marked things as xfail meaning this is a 'flaky' test, they can switch it to a skip with a simple find/replace. Note that I'm Ok if a flaky mark is added for that use-case if it shouldn't be confused with a skip because we don't run on this platform (but it still needs to have the same meaning of skip: there's no reason to run a flaky test if regardless of the outcome it'll always pass). |
Closing as superseded |
I'd like to continue the discussion from #776 here. What I said there:
I agree there should be some difference between "is expected to fail" and "is flaky".
I personally use
xfail
for "I found a bug while writing tests, or found a bug and wrote a test for it - but I don't want to fix it immediately, and I don't want my CI failing because of it". (I don't have any flaky tests, fortunately)IMHO,
xpassed
should be red, and a newpytest.mark.flaky
marker should be added - then it's also immediately clear for someone reading the tests that this test is flaky, not expected to fail.cc @esiegerman
The text was updated successfully, but these errors were encountered: