Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge rerun results when --flaky-test-attempts is used #456

Closed
bootstraponline opened this issue Jan 14, 2019 · 10 comments
Closed

Merge rerun results when --flaky-test-attempts is used #456

bootstraponline opened this issue Jan 14, 2019 · 10 comments

Comments

@bootstraponline
Copy link
Contributor

bootstraponline commented Jan 14, 2019

flakyTestAttempts runs tests multiple times. Currently any failures will be reported as a failure in Flank, even if the test passes on retry.

Fetch the test rerun results, and consider the full history of a test to determine if it's passed or not.

@gtroshin
Copy link

gtroshin commented Jan 18, 2019

Will it also fix Exited with code 1 in the end of Flank run when there is flaky-test-attempts and more retried tests passed than failed?

I went through the discussion in Slack. I have an idea of how it can works with xml report out:

  1. A test case runs at the first time. If it passes, Flank proceeds to the next test.
  2. If the test fails, Flank runs it again until it passes or fails x (flaky-test-attempts) times.
  3. The most frequent outcome is recorded as the test result.
  4. If the test result differs between test runs, the test is marked as unstable:
  • More passed attempts then failed: Passed - Unstable
  • More failed attempts then passed: Failed - Unstable

I have used similar flaky-test approach with another ui testing framework quite successful.

@bootstraponline
Copy link
Contributor Author

Will it also fix Exited with code 1 in the end of Flank run when there is flaky-test-attempts and more retried tests passed than failed?

Yep! I think the idea is if the test passes at least once then it's considered passed.

@valeraz valeraz self-assigned this Mar 6, 2019
@valeraz valeraz changed the title Report on last run when using flakyTestAttempts Merge rerun results when --flaky-test-attempts is used Mar 6, 2019
@valeraz
Copy link
Contributor

valeraz commented Mar 7, 2019

Current situation. If --flaky-test-attempts is set to >0 and any of the tests in the matrix fail, FTL will rerun the matrix and place the rerun results in a subdirectory called rerun_#. Here's a sample run with flaky-test-attempts=3 where one test always passed and one always failed:

results/2019-03-07_00-15-58.110000_IbGW
results/2019-03-07_00-15-58.110000_IbGW/JUnitReport.xml
results/2019-03-07_00-15-58.110000_IbGW/CostReport.txt
results/2019-03-07_00-15-58.110000_IbGW/HtmlErrorReport.html
results/2019-03-07_00-15-58.110000_IbGW/flank.yml
results/2019-03-07_00-15-58.110000_IbGW/MatrixResultsReport.txt
results/2019-03-07_00-15-58.110000_IbGW/shard_0
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait/test_result_1.xml
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_2
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_2/test_result_1.xml
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_3
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_3/test_result_1.xml
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_1
results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_1/test_result_1.xml
results/2019-03-07_00-15-58.110000_IbGW/matrix_ids.json

Right now, flank will aggregate all the result producing a JUnitReport.xml with 8 total tests and 4 failures. Instead of 8 result, we'd really like 2 results (one for each unique test executed).

Proposed solution:

  1. Merge the redundant test results from each shard into the top-level xml result file (results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait/test_result_1.xml) in the following way:
  • if a test passed in at least one of the attempts (initial or rerun), mark it as passed
  • if it failed in all attempts, mark as failed. There is a possible edge case that a test can fail in multiple ways. To keep it simple, I propose to use the last failure result for the <failure> element.
  1. If a test ultimately passed, but was flaky (i.e. failed either the first time or in at least once in one of the reruns), mark the test as flaky in the xml file like so: <testcase name="testPasses" classname="com.example.app.ExampleUiTest" time="0.0" "flaky="true">. A possible future optimization, we could report the flaky rate (number of failures/total attempts), but I'd like to keep the initial version simple.

An alternative to 2. could be to dump the flaky info to a separate file inside each shard, but to me, it seems like keeping it in one location would be desirable.

A couple of questions I haven't looked into yet:

  • will flank roll up the flaky="true" attributes to the top-level JUnitReport.xml? (if not, we'd need to do that)
  • what happens if repeatTests and flakyTestAttempts is used in the same config? I don't see a case where that would be needed. If so, we should probably add that validation to the doctor.

@winterDroid
Copy link
Contributor

Sounds like a great idea. But I'd suggest valuing the JUnit XSD specifications, meaning I think the (2) approach would be more suitable, even though (1) would be more userfriendly. Maybe we could also dump the flaky results into the properties of the report to contain e.g. a comma-separated list of all flaky testcases?

@bootstraponline
Copy link
Contributor Author

  • what happens if repeatTests and flakyTestAttempts is used in the same config?

I think we'd want to repeat them normally. It does seem like a weird use case.

  • will flank roll up the flaky="true" attributes to the top-level JUnitReport.xml? (if not, we'd need to do that)

I think modifying the report to include the data we need makes sense. The iOS JUnit is already custom from FTL so following JUnit XSD exactly isn't possible.

@valeraz
Copy link
Contributor

valeraz commented Mar 7, 2019

  • what happens if repeatTests and flakyTestAttempts is used in the same config?

I think we'd want to repeat them normally. It does seem like a weird use case.

I thought the whole idea of repeatTests was to measure flakiness of a test. It would seem redundant to use that in conjunction with flakyTestsAttempts, since you'd want the raw data anyway. Or am I missing another use case?

@valeraz
Copy link
Contributor

valeraz commented Mar 7, 2019

A couple more small questions

  1. since we're now merging the xml from the rerun directories (e.g. results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_1/test_result_1.xml), do we still want to leave them in the results directory or delete them to avoid confusion?
  2. do we tweak the name of the aggregate shard result xml file to imply that it was merged? (e.g. results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait/test_result_merged.xml)

@bootstraponline
Copy link
Contributor Author

I thought the whole idea of repeatTests was to measure flakiness of a test.

That's my use case for it. The idea is simply repeating tests though. If I wanted to check for something else, like performance measurement to average the metrics, we may want both repeat and flaky test attempts.

  1. since we're now merging the xml from the rerun directories (e.g. results/2019-03-07_00-15-58.110000_IbGW/shard_0/NexusLowRes-28-en-portrait-rerun_1/test_result_1.xml), do we still want to leave them in the results directory or delete them to avoid confusion?

We should probably leave them since the folder structure is intended to mirror what's on GCS. We're not deleting them on GCS so deleting them locally would be surprising I think.

2. do we tweak the name of the aggregate shard result xml file to imply that it was merged?

Yeah if we are generating the file by merging XML then naming it appropriately makes sense.

@gtroshin
Copy link

the reruns are executed in parallel

how it will consider the last rerun then?

@bootstraponline
Copy link
Contributor Author

I think we'll download the results from all runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants