Mark for flaky tests #814

The-Compiler · 2015-07-03T04:40:31Z

I'd like to continue the discussion from #776 here. What I said there:

I agree there should be some difference between "is expected to fail" and "is flaky".

I personally use xfail for "I found a bug while writing tests, or found a bug and wrote a test for it - but I don't want to fix it immediately, and I don't want my CI failing because of it". (I don't have any flaky tests, fortunately)

IMHO, xpassed should be red, and a new pytest.mark.flaky marker should be added - then it's also immediately clear for someone reading the tests that this test is flaky, not expected to fail.

cc @esiegerman

The text was updated successfully, but these errors were encountered:

nicoddemus · 2015-07-03T14:02:13Z

I agree with everything you wrote, but I fear the backward compatibility implications of making currently passing test suites with flaky tests (marked as xfail) to start to fail after upgrading pytest is a problem... 😟

The-Compiler · 2015-07-03T14:10:12Z

I'm not sure I understand - all I'm proposing is adding a new pytest.mark.flaky mark in addition to the existing xfail one, and not changing xfail 😉

nicoddemus · 2015-07-03T14:23:06Z

Sorry, I meant about this bit:

IMHO, xpassed should be red, and a new...

The-Compiler · 2015-07-03T16:17:52Z

Oh, sorry - now I understand.

That's only about the summary color though - IMHO it's acceptable that those are red on xpassed tests, as long as the exit status still is 0. It should be a good reminder for people to start using the flaky marker, but not break anything using a CI or similar.

esiegerman · 2015-07-03T16:26:27Z

[Thanks, @The-Compiler, for keeping this discussion going.]

... backward compatibility implications ... [for] flaky tests

At first I agreed about this, but I'm having serious second thoughts[1].

I've just done an exhaustive search of the docs for related strings: "xfail", "xpass", "flaky", "intermittent". What I've found is that all description of xfail's usage says that it's for tests that cannot succeed -- not a single mention of tests that might not succeed.[2]

So I have to wonder: how appropriate is it to preserve backward compatibility for the benefit of those who used the feature contrary to its documentation, at the expense of those who used it as the documentation says it's intended to be used?

I absolutely agree on the need to be able to mark flaky tests; but once such a marker is available, does xfail really have to remain so ambiguous that it's not really useful for its originally intended, and still documented, purpose?

[1] prompted largely by my failure to come up with nearly as good a name as "xfail" for a way to mark hard-failing tests

[2] The only references to xfail for flaky tests are pretty oblique:

one example reason string in a changelog entry from v1.3.2, discussing something else entirely:

Funcarg factories can now dynamically apply a marker to a test invocation. This is for example useful if a factory provides parameters to a test which are expected-to-fail:

def pytest_funcarg__arg(request):
request.applymarker(py.test.mark.xfail(reason="flaky config"))
...

def test_function(arg):
...

And even this is ambiguous -- the reason string says "flaky", but the text says "expected-to-fail"
a couple of changelog entries refer to changes to xfail's behaviour that seem to make more sense in this context -- but that doesn't necessarily indicate a change of intention for xfail, just that even then, people were trying to shoehorn it into a use case for which it was never intended

Migration strategy

This is intended to minimize disruption to both camps.

Initially:
- Add a flaky() marker and function, explicitly documented as being for intermittently failing tests; the outcome of a test so marked (whether it passes or fails) should not affect either the summary bar's color or pytest's exit status. flaky() could have report_if_passes and report_if_fails boolean arguments (or, equivalently, report_if={"passes" | "fails" | "both"}) that determine whether a line is printed for it in the list at the end of the run. (It could even have warn_if, similarly; but fail_if is of course pointless)
- Add a boolean flaky parameter to xfail(); if True, treat the test as described above for flaky() (i.e. what xfail() does now); if False, treat a "passing" test as a hard failure -- red summary bar and non-zero exit status
- Have that flaky= parameter default to True; this preserves existing behaviour for preexisting code
- Issue: should the flaky= parameter be documented right from the start as a temporary, transitional feature?
After some appropriate amount of time has passed -- and with much warning to users:
- Flip the default value for xfail()'s flaky= parameter's to False -- this makes xfail() with no parameter behave as one wants for a test that is always expected to fail
- Deprecate flaky=True (but not flaky=False)
- (optionally) Print a deprecation warning if True is passed
After some further amount of time has passed:
- Remove support and documentation for xfail(flaky=True) ("remove support" means, treat it as an error)
- Document flaky=False as historical and with no effect
(optionally) Deprecate xfail(flaky=False), and still later, remove the flaky= parameter entirely

This would leave us in (what I consider to be) a desirable end state:

there's a way to mark tests that fail intermittently
xfail isn't it :-)
xfail is back to meaning a test is expected to fail every time; if the test passes, it's treated exactly as severely as the failure of a normal (i.e. unmarked) test

nicoddemus · 2015-07-07T19:41:36Z

Thanks for writing such a comprehensive proposal @esiegerman!

Perhaps you could send this proposal to the mailing list? Most users don't subscribe to the repository issues, and I think the mailing list might be a better media for the discussion.

The-Compiler · 2015-07-08T09:32:36Z

Also see #259 for some related issues - it seems the behaviour that xpass exited as an error already existed in earlier versions?

pfctdayelise · 2015-07-10T09:38:29Z

I can't say I'm a huge fan of pytest implicitly encouraging people to leave intermittently failing tests in their test suite. A test that you don't care if it passes or fails? May as well be a skip. Or just delete it, it's not telling you anything!

Intermittent fails are interesting and important because they tend to reveal where your tests have hidden dependencies. At least in the current state you are forced to confront them and hopefully find the cause of the flakiness. To me this is a pretty big change, to come from literally one comment from @hpk42 - I'd like to hear more about his use case exactly before this gets much more involved.

Sorry if this comes across as overly negative - but basically I think the rationale for inclusion needs a more rigorous examination. I'm open to being convinced. :)

nicoddemus · 2015-07-10T11:25:53Z

Hi @pfctdayelise,

Flaky tests are usually introduced in a code base due to complex code being tested (say multi threaded or asynchronous code) or due to platform differences (works on Windows but occasionally breaks on Linux, and vice-versa). Sometimes the flakiness is due to a problem in the test itself and the code it tests is fine and known to work. Flaky tests are of course a problem and should be dealt with as soon as possible in order to keep the test suite working and healthy.

The major problem with flaky tests is that they disrupt other developer's builds. It's frustrating for developers working on a piece of code in a feature branch, for their build to fail because of a completely unrelated test which was working fine in the previous build. Sometimes the developers which originally worked on the failing test are around and can take a look right away, but more often than not they might be busy with something else, like in a meeting or too entrenched in some other feature to help right away.

Our process when someone encounters a flaky test is to:

First, see if you can get in touch with the person who wrote the test to check if they can take a look at the problem and attempt to fix it right away;
If that person is not available, take a look at the test himself and perhaps fix it; sometimes the cause of the "flakyness" is obvious (like tests running in parallel which share resources like a file for example);
As a last resort, open a ticket for the flaky test addressed to a developer and mark the test as xfail with a reason which includes the ticket number and any other relevant information. This ticket is usually put at the top of the backlog so it can be worked on as soon as possible.

This process prevents flaky tests, which unfortunately do happen, from disrupting the workflow of the other developers and makes it clear that there are issues to be solved in the test suite.

Having said all that, I think the proposal of having a specific mark for flaky which means "don't break the test suite if this fails but note that it passes" is sound. It has a different purpose than a skip, which usually means "this test should be skipped because of reasons", with reasons being not able to run in a certain platform, python version, OS, etc. IOW, flaky tests should send the message "this should be looked into", while skipped tests are OK and not a problem. Having a mark for tests that are expected to fail but don't break the test suite, but break the suite if they start suddenly to pass also has its uses: in some situations, you create a test for a feature which is failing right now, but once it is fixed then the test should have the mark removed and promoted to a "normal" test. Currently with "xfail"'s behavior it is easy for developers to fail to notice this and let the mark in there forever, eternally xpassing.

Hope my points make some sense, and sorry for the long post! 😅

pfctdayelise · 2015-07-10T18:05:23Z

OK, what you have said makes sense, but it sounds like an argument for showing xpass results more prominently (which I think is what happened in #776?) (which I also agree with).

Actually #716 sounds like xfail behaviour has recently changed. If this is the case, I don't know what's going on anywhere any more!

My understanding of Jenkins is that it expects your test suite to "pass" (all tests act as expected) or "fail". There is no in between. I don't know if Travis or other CI systems have more nuance there. If all tests act as expected, IME nobody looks at the results.

OK so maybe what's being proposed is this:
CURRENTLY/PREVIOUSLY: xfail test that xpasses - happens silently (doesn't break a build)
#776: xfail test that xpasses - happens noisily (will break a build)
PROPOSAL HERE: xfail test that xpasses - happens silently - to be called 'flaky' (won't break a build)

Have I got it right?

pfctdayelise · 2015-07-10T18:08:48Z

Having thought about it some more, maybe another idea would be to have a strict=True/False parameter for the xfail mark. (If strict, an xpass result should break the build.) If a new mark is introduced, it would need to be mindful of how it interacts with things like the --runxfail option.

nicoddemus · 2015-07-10T21:06:13Z

Have I got it right?

Yep. 😄

If a new mark is introduced, it would need to be mindful of how it interacts with things like the --runxfail option.

That's a good point. Perhaps keeping a single mark with a flag is the way to go.

dariuspbom · 2015-07-15T05:12:26Z

xfail originally expected the test to fail, if it didn't and xpass instead then the test job was failed with non zero exit code and red.

This was changed, apparently to handle flaky tests. I'm sure I found this on the mailing list as to why it was changed. I'm sure I also found the version that it changed in. Sorry didn't keep the references to them.

There appears to be concern about breaking backward compatibility now that xfail is used for flaky tests and can either xfail or xpass with zero exit code and green.

However the flip side is that backward compatibility has already been broken in the first change. Test suites that were expecting xfail to expect the test to fail and if it xpass then failed the test job with non zero exit code and red are now broken. Even though it is a silent failure, those tests are no longer working as they were written.

So changing xfail back to the original strict behaviour then put up with tests failing loudly because they are flaky and need the mark updated.

Alternatively keeping xfail with current flaky behaviour then put up with tests silently xpass which hides problems and need the mark updated.

One is loud and one is silent but they are both important.

Given that both options have caused backward compatibility problems I'd prefer the semantic naming to make sense. In #716 I suggested xfail and xfailorxpass for flaky. Having xfail and flaky is fine (and possibly better than xfailorxpass).

If xfail is going to be overloaded with a flag and used for both then perhaps we can have a global setting in conftest rather than a default of flaky.

On our project I changed our conftest to detect xpass tests and set the exit code back to non zero. We have over 10,000 tests in pytest which is run under jenkins and we need xfails to match the expected outcome which is to fail and if for some reason it xpass then we need the non zero exit code so that we can be alerted to this as a failed build.

See also #716 #259 #810 #776

barneygale · 2015-07-17T11:33:43Z

+1 on xpass being considered a problem. That confused me when I first started using py.test.

nicoddemus · 2015-07-17T14:52:40Z

I'm starting to think that breaking the compatibility is OK, as fixing it usually is simple enough.

I would like to see this in 2.8, but first we should reach a consensus and gather opinions from other pytest developers.

How about:

New mark, pytest.mark.flaky(reason, run=True): tests with this mark never break the build. I propose k as the status letter for it, and in the summary line we display just how many flaky tests executed, without any mention if they passed or not. For example:
```
test_foo.py .x...k...k...

== 10 passed, 2 flaky, 1 xfailed in 0.03 seconds ==
```
No new command line option related to flaky tests, keeping --runxfail exactly as it is today (dealing with xfail markers only); if one wants to execute just flaky tests there's the option of using -m flaky.
pytest.mark.xfail: tests with this mark that pass break the build (return non-zero).
Optional: add a new ini-option that enables the new xfail behavior, defaulting to False. In 2.9, we change the default to True, so there would be plenty of time to update test suites that contain flaky tests.

I volunteer to work on a PR once we agree on what should be implemented.

flub · 2015-07-18T14:02:50Z

I like @pfctdayelise's suggestion of adding xfail(strict=True) so you can make xpass result in failures again. Maybe even a way to default to that in pytest.ini.

But I also think her argument for not introducing a flaky mark is fairly sound, would it make sense and be possible to implement the flaky mark outside of the core, i.e. as a plugin? We could still refer to that plugin in the official xfail documentation.

dariuspbom · 2015-07-31T00:16:36Z

Preference for separate marker since both have clear separate semantics. Would suggest consistent with xfail options so flaky marker and --runflaky option.

That said I'm fine with the option to overload xfail though I'd like to see a default in conftest / ini so we can get back to the original behaviour before it changed to allow any result.

nicoddemus · 2015-08-01T14:24:53Z

Just noticed that unittest has a @expectedFailure decorator with the same semantics proposed here (failures are OK, unexpected passes are not OK). I think this makes a stronger point to a separate marker.

schettino72 · 2015-08-11T14:10:29Z

Sorry I didnt notice this issue before and just implemented my own plugin to deal with "flaky" tests. https://github.com/schettino72/pytest-ignore-flaky

My approach is to report a flaky test normally as passed when it pass, and as xfail when it fails.
The implementation is really simple and backwards compatible.

RonnyPfannschmidt · 2015-08-11T20:46:45Z

A harsh note as opposing view - flaky tests are a indication of fundamental flaws, I'd prefer a lib with retry utilities that has test framework integration

nicoddemus · 2015-08-11T21:04:04Z

flaky tests are a indication of fundamental flaws

Agreed, but please note that xfail as it stands currently already works as flaky, that's why the proposal to add a new mark or add a strict parameter to xfail.

pfctdayelise · 2015-08-11T21:10:24Z

@RonnyPfannschmidt

A harsh note as opposing view - flaky tests are a indication of fundamental flaws, I'd prefer a lib with retry utilities that has test framework integration

That would be useful, but as it stands we are close to having something that is also useful in its own right. (So that other thing which would be useful should not be considered a blocker to making smaller progress in the mean time.) XPASSes should be considered a fail (as they were several releases ago), and with the plugin flaky mark they would not be considered a fail.

So, given this plugin, it seems to me the task required is to make XPASSes considered a fail, and highlight this plugin in release notes and docs.

nicoddemus · 2015-08-11T21:14:20Z

So, given this plugin, it seems to me the task required is to make XPASSes considered a fail, and highlight this plugin in release notes and docs.

Hmmm agree... people which currently rely on xfail behaving like "flaky" will have to start using the plugin, but I think that's a reasonable trade-off.

schettino72 · 2015-08-12T02:57:10Z

A harsh note as opposing view - flaky tests are a indication of fundamental flaws, I'd prefer a lib with retry utilities that has test framework integration

Yes, flaky tests indicate a flaw but they are a fact of life that many projects have to live with.
I agree this plugin should not be on core because people really should use it on last resort only.

I think both ignoring and re-running tests approaches are useful. Let's say a test depends on an external resource that is temporarily unavailable, even if you re-try the tests it will keep failing because the retries will happen in a short span of time.

RonnyPfannschmidt · 2015-08-12T10:21:25Z

from my pov a library that does exponential backoff retry, then fails the test with a xfail reporting the external ressource as gone missing is much preferable over rerunning a whole test n times then getting a worse explanation

davehunt · 2015-08-12T10:50:03Z

I agree with the suggestion that xpasses should cause a failing exit code. I personally believe that flaky tests should either be fixed or deleted, but I do understand the desire to ignore or retry tests marked as flaky. I would even consider using such a plugin myself knowing that these ignores/retries would apply to tests that I have explicitly marked and therefore understand the cost. Having a reason supported in the mark that's included in the console/reports would also be useful, to encourage understanding the cause of flakyness, and to explain it to other test authors.

astraw38 · 2015-09-18T13:20:02Z

@RonnyPfannschmidt @schettino72 @davehunt

There already exists multiple plugins with support for re-running flaky tests: https://github.com/gocept/pytest-rerunfailures is one.

As someone who has to deal with a flaky product, flaky tests should be avoided at all costs. It only provides misinformation, and leads to devs & testers ignoring actual failures. I would have to go in on the side of 'XPASS = FAIL', and not bother with any core pytest functionality for supporting flakiness. If people want support for it, there's plenty as plugins.

codewarrior0 · 2015-09-18T13:31:13Z

A test that you don't care if it passes or fails? May as well be a skip. Or just delete it, it's not telling you anything!

These are my thoughts exactly.

I like the idea of a strict parameter to xfail, though. I like finding out when I accidentally fixed an xfailing test, so I can take off the xfail marker for the next time I break it.

fabioz · 2015-12-29T12:43:46Z

Just to add to the discussion, my opinion is that xfail should really fail if the test doesn't raise an error.

Flaky tests shouldn't be run at all as they don't add anything useful to the run (thus, the proper way here would be skipping the test if it's flaky) -- making something such as xfail (which to me reads as xfail: expected to fail) mixed with flakiness seems a bit weird... as for backward compatibility, I wouldn't worry so much... if people marked things as xfail meaning this is a 'flaky' test, they can switch it to a skip with a simple find/replace.

Note that I'm Ok if a flaky mark is added for that use-case if it shouldn't be confused with a skip because we don't run on this platform (but it still needs to have the same meaning of skip: there's no reason to run a flaky test if regardless of the outcome it'll always pass).

RonnyPfannschmidt · 2016-01-01T16:33:36Z

Closing as superseded

nicoddemus added the type: proposal proposal for a new feature, often to gather opinions or design the API around the new feature label Jul 5, 2015

dariuspbom mentioned this issue Jul 15, 2015

Test that unexpectedly passes doesn't produce exit code #259

Closed

nicoddemus mentioned this issue Aug 11, 2015

Change default to always ignore flaky tests coherent-oss/pytest-ignore-flaky#1

Closed

RonnyPfannschmidt mentioned this issue Jan 1, 2016

Reintroduce error on xpass #1299

Closed

RonnyPfannschmidt closed this as completed Jan 1, 2016

The-Compiler mentioned this issue Feb 3, 2016

Option of treating XPASS as failure? #1355

Closed

pytestbot mentioned this issue May 24, 2018

Override marks on items? #3501

Closed

pytestbot mentioned this issue Jun 8, 2018

Internal Error when rerun occurs on tests that are members of a class using mark. #3555

Closed

pytestbot mentioned this issue Jul 22, 2018

test_idval_hypothesis flaky on Windows #3707

Closed

The-Compiler mentioned this issue Jul 22, 2020

Make --strict imply --strict-config / xfail_strict? #7503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark for flaky tests #814

Mark for flaky tests #814

The-Compiler commented Jul 3, 2015

nicoddemus commented Jul 3, 2015

The-Compiler commented Jul 3, 2015

nicoddemus commented Jul 3, 2015

The-Compiler commented Jul 3, 2015

esiegerman commented Jul 3, 2015

nicoddemus commented Jul 7, 2015

The-Compiler commented Jul 8, 2015

pfctdayelise commented Jul 10, 2015

nicoddemus commented Jul 10, 2015

pfctdayelise commented Jul 10, 2015

pfctdayelise commented Jul 10, 2015

nicoddemus commented Jul 10, 2015

dariuspbom commented Jul 15, 2015

barneygale commented Jul 17, 2015

nicoddemus commented Jul 17, 2015

flub commented Jul 18, 2015

dariuspbom commented Jul 31, 2015

nicoddemus commented Aug 1, 2015

schettino72 commented Aug 11, 2015

RonnyPfannschmidt commented Aug 11, 2015

nicoddemus commented Aug 11, 2015

pfctdayelise commented Aug 11, 2015

nicoddemus commented Aug 11, 2015

schettino72 commented Aug 12, 2015

RonnyPfannschmidt commented Aug 12, 2015

davehunt commented Aug 12, 2015

astraw38 commented Sep 18, 2015

codewarrior0 commented Sep 18, 2015

fabioz commented Dec 29, 2015

RonnyPfannschmidt commented Jan 1, 2016

Mark for flaky tests #814

Mark for flaky tests #814

Comments

The-Compiler commented Jul 3, 2015

nicoddemus commented Jul 3, 2015

The-Compiler commented Jul 3, 2015

nicoddemus commented Jul 3, 2015

The-Compiler commented Jul 3, 2015

esiegerman commented Jul 3, 2015

Migration strategy

nicoddemus commented Jul 7, 2015

The-Compiler commented Jul 8, 2015

pfctdayelise commented Jul 10, 2015

nicoddemus commented Jul 10, 2015

pfctdayelise commented Jul 10, 2015

pfctdayelise commented Jul 10, 2015

nicoddemus commented Jul 10, 2015

dariuspbom commented Jul 15, 2015

barneygale commented Jul 17, 2015

nicoddemus commented Jul 17, 2015

flub commented Jul 18, 2015

dariuspbom commented Jul 31, 2015

nicoddemus commented Aug 1, 2015

schettino72 commented Aug 11, 2015

RonnyPfannschmidt commented Aug 11, 2015

nicoddemus commented Aug 11, 2015

pfctdayelise commented Aug 11, 2015

nicoddemus commented Aug 11, 2015

schettino72 commented Aug 12, 2015

RonnyPfannschmidt commented Aug 12, 2015

davehunt commented Aug 12, 2015

astraw38 commented Sep 18, 2015

codewarrior0 commented Sep 18, 2015

fabioz commented Dec 29, 2015

RonnyPfannschmidt commented Jan 1, 2016