Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface test regressions in PRs #7475

Closed
foolip opened this issue Sep 25, 2017 · 25 comments
Closed

Surface test regressions in PRs #7475

foolip opened this issue Sep 25, 2017 · 25 comments

Comments

@foolip
Copy link
Member

foolip commented Sep 25, 2017

Currently, Travis will check if tests are flaky and if so fail the build. Otherwise, it is considered a pass and surfaced as a comment linking to details. Example: #7472

In order to allow test writers to more easily understand the impact of their changes, we should also clearly surface when a new test is failing, or when an existing test goes from passing to failing.

To do this, Travis also has to run the tests once without the changes applied.

@RByers
Copy link
Contributor

RByers commented Sep 25, 2017

+1, this has always been a key piece of what we were looking for from the PR tooling. @bobholt discusses it a little here, and we discussed it in the original PR validation brainstorming.

When a test goes from failing to passing do we also want to surface it in the same way? Eg. I may make some test changes specifically to get the test passing on a number of browsers, it should be easy to tell at a glance if I succeeded at that.

@foolip
Copy link
Member Author

foolip commented Sep 27, 2017

I think that we want to easily see the old and new status of any test that has been touched, including if the old or new status is "doesn't exist".

Probably we will run into the GitHub comment size limit, and the full view will have to be on pulls.web-platform-tests.org. That should always be linked in a comment like #7506 (comment)

Then, we should decide which kinds of changes warrant another comment pointing out that something unusual is happening. That includes at least going passing to failing in some browser, and of course going from non-flaky to flaky.

@foolip
Copy link
Member Author

foolip commented Oct 1, 2017

I'm preparing a presentation for https://webengineshackfest.org/ and made this mockup:
screen shot 2017-10-01 at 2 36 43 pm

@jgraham
Copy link
Contributor

jgraham commented Oct 17, 2017

I have some concerns about this.

The easiest solution would just be to run all the tests twice on travis. But we already have capacity issues and I don't fancy adding an extra job, or additional work to the existing jobs. I'm also not sure that this actually helps in most cases. In particular who will see these results? For gecko developers we want something like this, but on the bug we generate when the PR is imported. In theory if the data is generated upstream it might be possible to use that, but certainly having it as a GH comment alone isn't going to get visibility in the right places.

I can see three possible parts to this:

  • Highlight new tests that don't pass (on one browser? In any browser?). This can be done with the existing data plus git status.

  • Highlight changes in status of tests. This could be done by combining the data from the wpt.fyi dashboard with the existing data. It does have some edge case issues if a test is changed multiple times, but that approach at least doesn't generate new travis load.

  • Find out if a test was unstable on master before it was updated, and not fail the build for unstable => unstable transitions. This probably requires running unstable tests again on master in the same travis job as the current stability check (so we can set the job status; anything else is a much larger change and probably involves the dashboard setting the job status and travis always returning success). But this adds more load and we already have frequent issues with the jobs timing out when there are larger changes, so I'm reluctant to add this until we have some solution for longer running jobs, or splitting the jobs up, or something.

@foolip
Copy link
Member Author

foolip commented Oct 18, 2017

In particular who will see these results?

In the first instance, on any PR on GitHub. Then, if people don't rebel, we'd try reflecting that same information into Chromium code review, so that if an in-flight Chromium change is going to regress the test for Firefox, we'd know about it, and fix it if it's not intentional.

As for presentation, I think it'd be nice if we could have the results more in the shape of wpt.fyi both before and after the changes, and then some kind of diff summary view. @mdittmer, FYI, this is a bit similar to what we discussed with diffing whole runs today. (@jgraham, no concrete plans around that.)

@foolip
Copy link
Member Author

foolip commented Oct 18, 2017

This could be done by combining the data from the wpt.fyi dashboard with the existing data.

This will usually work, and I think we might be able to end up there with web-platform-tests/results-collection#164, but in the short term I wonder how noisy it would be, with wpt.fyi being slightly out of date, and not run in exactly the same way.

Elsewhere we talked about teaching the stability checker about the timeout so that it would run fewer times if needed, and I think that might be a mitigation for the single extra run this change would introduce. (Or just reduce stability runs from 10 to 9.)

Find out if a test was unstable on master before it was updated

I think it's fine to just not do this for now, certainly changes in stability has never been what I've been looking for when wanting to understand what my wpt change did to the results.

@jgraham
Copy link
Contributor

jgraham commented Oct 18, 2017

This will usually work, and I think we might be able to end up there with web-platform-tests/results-collection#164, but in the short term I wonder how noisy it would be, with wpt.fyi being slightly out of date, and not run in exactly the same way.

I think I would like to try this and see how often it fails before trying a different approach.

Elsewhere we talked about teaching the stability checker about the timeout so that it would run fewer times if needed, and I think that might be a mitigation for the single extra run this change would introduce. (Or just reduce stability runs from 10 to 9.)

10 is already pretty low and we are still getting some unstable tests on import, so I'm reluctant to reduce this further (FWIW there is work at mozilla to add a "verify" mode to harnesses that runs each test 10x without a browser restart and then 5x with a restart and then the same again in "chaos mode" that randomises some internals around scheduling, network, etc. to try to increase the chance of hitting race conditions).

@mdittmer
Copy link
Contributor

mdittmer commented Oct 19, 2017 via email

@foolip
Copy link
Member Author

foolip commented Oct 19, 2017

I think that's the right long term trajectory: to share the running infrastructure and to make the data available in a way that can be used to produce both GitHub comments, Gerrit comments, and a web UI that they'd link to.

But, while tests are run on Travis, what we need is just the current results, and we'll have to filter+process them to produce the GitHub comment.

@mdittmer
Copy link
Contributor

mdittmer commented Oct 19, 2017 via email

@jgraham
Copy link
Contributor

jgraham commented Oct 19, 2017

I think you might each using "current" and "previous" to mean the same thing.

@foolip
Copy link
Member Author

foolip commented Oct 19, 2017

@mdittmer, yes, we do need the results without the changes, the question is whether we get them by running the tests without the changes once on Travis or if we get them from wpt.fyi. The former will be correct and the latter will have some amount of hard-to-understand noise, but the setup will be more like how we want it to eventually be.

I think that if we do compare with results from wpt.fyi, then we should try to apply the changes to the same commit that was run on wpt.fyi. But then that means that recovering from something broken in the wpt repo will take longer as wpt.fyi first has to catch up.

@foolip
Copy link
Member Author

foolip commented Jan 3, 2018

Adding @mattl as assignee as this work will be owned by Bocoup going forward.

@foolip
Copy link
Member Author

foolip commented Mar 13, 2018

@mariestaver, does it still make sense to have this assigned to @mattl, or should we revisit later?

@mariestaver
Copy link

@foolip we do look at this regularly, but it's been repeatedly deprioritized in favor of having stable, frequent runs. If things go well with the current changes, we might be able to look at this by the end of the month...but given the issues that have been cropping up, I don't want to make promises. However, I assume that even if it can't be addressed in March, we still want it on our roadmap, so as long as the title & description are still accurate and useful, we can leave this issue where it is for now. Let me know if you have any questions or suggestions about any of that, and thanks!

@foolip
Copy link
Member Author

foolip commented May 9, 2018

@lukebjerring, what do you think is currently the most likely way this will be achieved? Does it depend on #10503, or just on #9874, or some other path to the same goal?

@lukebjerring
Copy link
Contributor

Neither; Depends on web-platform-tests/wpt.fyi#118

@foolip
Copy link
Member Author

foolip commented May 14, 2018

I see. Asked on #9874 (comment) if there's some necessary work to get it into a form suitable to submitting to wpt.fyi.

@foolip
Copy link
Member Author

foolip commented Jul 16, 2018

@lukebjerring, this came up in priority:roadmap triage, and detecting regressions is an important current focus for us. Can you outline the issues standing in the way of resolving this at this point?

@lukebjerring
Copy link
Contributor

Trivializing the work efforts, it's essentially:

  • Upload results summaries from Travis to wpt.fyi, with labels etc
  • Comment on the PR linking to a view of said upload
  • Compute a diff and alert when there's any non-OK harness state, or any increase in failure count.

@foolip
Copy link
Member Author

foolip commented Jul 20, 2018

Upload results summaries from Travis to wpt.fyi, with labels etc

In our current planning, at least plan A, this step has turned into:

  • Get Taskcluster runs on stable master reliable (@jugglinmike)
  • Get Taskcluster results from master builds into wpt.fyi (@Hexcles)
  • Run stability tests for affected tests on Taskcluster instead of Travis (@jugglinmike)
  • Get Taskcluster results from PRs into wpt.fyi (@Hexcles)

If that doesn't work out, then perhaps we'll return to extracting results from Travis again. @lukebjerring, does that sound right?

@gsnedders
Copy link
Member

There's the case with infra changes (esp. to the runner, but also with things like testharness.js) where we want to run all the tests and compare results; we should at least make that possible somehow.

@lukebjerring
Copy link
Contributor

This is now a wpt.fyi Project (https://github.com/web-platform-tests/wpt.fyi/projects/6)

@mdittmer
Copy link
Contributor

Ping from your friendly neighbourhood ecosystem infra rotation. @lukebjerring can you comment on this issue with a short update on what's done and still to-be-done for this? I know following the link to the project achieves that, but if we're not going to close this issue and say "look at the project instead" maybe posting a quick snapshot of the work here is in order.

@lukebjerring
Copy link
Contributor

Closing this as fixed. Details of future work/improvements can be found in the project linked above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants