Surface test regressions in PRs #7475

foolip · 2017-09-25T18:31:16Z

Currently, Travis will check if tests are flaky and if so fail the build. Otherwise, it is considered a pass and surfaced as a comment linking to details. Example: #7472

In order to allow test writers to more easily understand the impact of their changes, we should also clearly surface when a new test is failing, or when an existing test goes from passing to failing.

To do this, Travis also has to run the tests once without the changes applied.

RByers · 2017-09-25T20:03:41Z

+1, this has always been a key piece of what we were looking for from the PR tooling. @bobholt discusses it a little here, and we discussed it in the original PR validation brainstorming.

When a test goes from failing to passing do we also want to surface it in the same way? Eg. I may make some test changes specifically to get the test passing on a number of browsers, it should be easy to tell at a glance if I succeeded at that.

foolip · 2017-09-27T13:27:35Z

I think that we want to easily see the old and new status of any test that has been touched, including if the old or new status is "doesn't exist".

Probably we will run into the GitHub comment size limit, and the full view will have to be on pulls.web-platform-tests.org. That should always be linked in a comment like #7506 (comment)

Then, we should decide which kinds of changes warrant another comment pointing out that something unusual is happening. That includes at least going passing to failing in some browser, and of course going from non-flaky to flaky.

foolip · 2017-10-01T12:39:58Z

I'm preparing a presentation for https://webengineshackfest.org/ and made this mockup:

jgraham · 2017-10-17T14:42:35Z

I have some concerns about this.

The easiest solution would just be to run all the tests twice on travis. But we already have capacity issues and I don't fancy adding an extra job, or additional work to the existing jobs. I'm also not sure that this actually helps in most cases. In particular who will see these results? For gecko developers we want something like this, but on the bug we generate when the PR is imported. In theory if the data is generated upstream it might be possible to use that, but certainly having it as a GH comment alone isn't going to get visibility in the right places.

I can see three possible parts to this:

Highlight new tests that don't pass (on one browser? In any browser?). This can be done with the existing data plus git status.
Highlight changes in status of tests. This could be done by combining the data from the wpt.fyi dashboard with the existing data. It does have some edge case issues if a test is changed multiple times, but that approach at least doesn't generate new travis load.
Find out if a test was unstable on master before it was updated, and not fail the build for unstable => unstable transitions. This probably requires running unstable tests again on master in the same travis job as the current stability check (so we can set the job status; anything else is a much larger change and probably involves the dashboard setting the job status and travis always returning success). But this adds more load and we already have frequent issues with the jobs timing out when there are larger changes, so I'm reluctant to add this until we have some solution for longer running jobs, or splitting the jobs up, or something.

foolip · 2017-10-18T13:59:36Z

In particular who will see these results?

In the first instance, on any PR on GitHub. Then, if people don't rebel, we'd try reflecting that same information into Chromium code review, so that if an in-flight Chromium change is going to regress the test for Firefox, we'd know about it, and fix it if it's not intentional.

As for presentation, I think it'd be nice if we could have the results more in the shape of wpt.fyi both before and after the changes, and then some kind of diff summary view. @mdittmer, FYI, this is a bit similar to what we discussed with diffing whole runs today. (@jgraham, no concrete plans around that.)

foolip · 2017-10-18T14:05:16Z

This could be done by combining the data from the wpt.fyi dashboard with the existing data.

This will usually work, and I think we might be able to end up there with web-platform-tests/results-collection#164, but in the short term I wonder how noisy it would be, with wpt.fyi being slightly out of date, and not run in exactly the same way.

Elsewhere we talked about teaching the stability checker about the timeout so that it would run fewer times if needed, and I think that might be a mitigation for the single extra run this change would introduce. (Or just reduce stability runs from 10 to 9.)

Find out if a test was unstable on master before it was updated

I think it's fine to just not do this for now, certainly changes in stability has never been what I've been looking for when wanting to understand what my wpt change did to the results.

jgraham · 2017-10-18T14:20:07Z

This will usually work, and I think we might be able to end up there with web-platform-tests/results-collection#164, but in the short term I wonder how noisy it would be, with wpt.fyi being slightly out of date, and not run in exactly the same way.

I think I would like to try this and see how often it fails before trying a different approach.

Elsewhere we talked about teaching the stability checker about the timeout so that it would run fewer times if needed, and I think that might be a mitigation for the single extra run this change would introduce. (Or just reduce stability runs from 10 to 9.)

10 is already pretty low and we are still getting some unstable tests on import, so I'm reluctant to reduce this further (FWIW there is work at mozilla to add a "verify" mode to harnesses that runs each test 10x without a browser restart and then 5x with a restart and then the same again in "chaos mode" that randomises some internals around scheduling, network, etc. to try to increase the chance of hitting race conditions).

mdittmer · 2017-10-19T13:21:30Z

IMHO wpt.fyi should host a service that can produce a report like this. We may need the service to be authenticated; an attacker shouldn't be able to overwhelm bots with requests to diff two arbitrary (but legitimate) web-platform-tests revisions.

…

On Wed, Oct 18, 2017 at 10:20 AM, jgraham ***@***.***> wrote: This will usually work, and I think we might be able to end up there with web-platform-tests/results-collection#164 <web-platform-tests/results-collection#164>, but in the short term I wonder how noisy it would be, with wpt.fyi being slightly out of date, and not run in exactly the same way. I think I would like to try this and see how often it fails before trying a different approach. Elsewhere we talked about teaching the stability checker about the timeout so that it would run fewer times if needed, and I think that might be a mitigation for the single extra run this change would introduce. (Or just reduce stability runs from 10 to 9.) 10 is already pretty low and we are still getting some unstable tests on import, so I'm reluctant to reduce this further (FWIW there is work at mozilla to add a "verify" mode to harnesses that runs each test 10x without a browser restart and then 5x with a restart and then the same again in "chaos mode" that randomises some internals around scheduling, network, etc. to try to increase the chance of hitting race conditions). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7475 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABsWSJQH6hM5jcCIhP8ylBfIR79LilSjks5stgkjgaJpZM4PjJwk> .

foolip · 2017-10-19T13:44:39Z

I think that's the right long term trajectory: to share the running infrastructure and to make the data available in a way that can be used to produce both GitHub comments, Gerrit comments, and a web UI that they'd link to.

But, while tests are run on Travis, what we need is just the current results, and we'll have to filter+process them to produce the GitHub comment.

mdittmer · 2017-10-19T17:00:13Z

Don't we also need previous results--i.e., without change applied?

…

On Thu, Oct 19, 2017, 9:45 AM Philip Jägenstedt, ***@***.***> wrote: I think that's the right long term trajectory: to share the running infrastructure and to make the data available in a way that can be used to produce both GitHub comments, Gerrit comments, and a web UI that they'd link to. But, while tests are run on Travis, what we need is just the current results, and we'll have to filter+process them to produce the GitHub comment. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7475 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABsWSMMwtKdqbQZ3qNmCmnnrYgteVpllks5st1JOgaJpZM4PjJwk> .

jgraham · 2017-10-19T17:20:19Z

I think you might each using "current" and "previous" to mean the same thing.

foolip · 2017-10-19T19:11:03Z

@mdittmer, yes, we do need the results without the changes, the question is whether we get them by running the tests without the changes once on Travis or if we get them from wpt.fyi. The former will be correct and the latter will have some amount of hard-to-understand noise, but the setup will be more like how we want it to eventually be.

I think that if we do compare with results from wpt.fyi, then we should try to apply the changes to the same commit that was run on wpt.fyi. But then that means that recovering from something broken in the wpt repo will take longer as wpt.fyi first has to catch up.

foolip · 2018-01-03T11:14:41Z

Adding @mattl as assignee as this work will be owned by Bocoup going forward.

foolip · 2018-03-13T03:39:28Z

@mariestaver, does it still make sense to have this assigned to @mattl, or should we revisit later?

mariestaver · 2018-03-14T16:03:53Z

@foolip we do look at this regularly, but it's been repeatedly deprioritized in favor of having stable, frequent runs. If things go well with the current changes, we might be able to look at this by the end of the month...but given the issues that have been cropping up, I don't want to make promises. However, I assume that even if it can't be addressed in March, we still want it on our roadmap, so as long as the title & description are still accurate and useful, we can leave this issue where it is for now. Let me know if you have any questions or suggestions about any of that, and thanks!

foolip · 2018-05-09T08:57:34Z

@lukebjerring, what do you think is currently the most likely way this will be achieved? Does it depend on #10503, or just on #9874, or some other path to the same goal?

lukebjerring · 2018-05-09T21:39:36Z

Neither; Depends on web-platform-tests/wpt.fyi#118

foolip · 2018-05-14T12:19:29Z

I see. Asked on #9874 (comment) if there's some necessary work to get it into a form suitable to submitting to wpt.fyi.

foolip · 2018-07-16T13:31:48Z

@lukebjerring, this came up in priority:roadmap triage, and detecting regressions is an important current focus for us. Can you outline the issues standing in the way of resolving this at this point?

lukebjerring · 2018-07-17T13:45:07Z

Trivializing the work efforts, it's essentially:

Upload results summaries from Travis to wpt.fyi, with labels etc
Comment on the PR linking to a view of said upload
Compute a diff and alert when there's any non-OK harness state, or any increase in failure count.

foolip · 2018-07-20T13:10:32Z

Upload results summaries from Travis to wpt.fyi, with labels etc

In our current planning, at least plan A, this step has turned into:

Get Taskcluster runs on ~~stable~~ master reliable (@jugglinmike)
Get Taskcluster results from master builds into wpt.fyi (@Hexcles)
Run stability tests for affected tests on Taskcluster instead of Travis (@jugglinmike)
Get Taskcluster results from PRs into wpt.fyi (@Hexcles)

If that doesn't work out, then perhaps we'll return to extracting results from Travis again. @lukebjerring, does that sound right?

gsnedders · 2018-09-03T17:55:52Z

There's the case with infra changes (esp. to the runner, but also with things like testharness.js) where we want to run all the tests and compare results; we should at least make that possible somehow.

lukebjerring · 2018-10-31T17:40:05Z

This is now a wpt.fyi Project (https://github.com/web-platform-tests/wpt.fyi/projects/6)

mdittmer · 2019-01-25T15:21:59Z

Ping from your friendly neighbourhood ecosystem infra rotation. @lukebjerring can you comment on this issue with a short update on what's done and still to-be-done for this? I know following the link to the project achieves that, but if we're not going to close this issue and say "look at the project instead" maybe posting a quick snapshot of the work here is in order.

lukebjerring · 2019-01-25T15:34:45Z

Closing this as fixed. Details of future work/improvements can be found in the project linked above.

foolip added infra priority:roadmap labels Sep 25, 2017

foolip assigned lukebjerring Sep 27, 2017

foolip mentioned this issue Sep 27, 2017

Add negative tests for prefixed variants of MediaStream, Speech and WebRTC #7507

Merged

foolip mentioned this issue Oct 4, 2017

Stability checker should run for reference changes #5319

Closed

foolip mentioned this issue Oct 13, 2017

Merge progress-events into XMLHttpRequest #7754

Merged

lukebjerring mentioned this issue Oct 17, 2017

Add endpoint for surfacing 'latest' test run web-platform-tests/results-collection#161

Closed

foolip mentioned this issue Oct 18, 2017

Test every commit of web-platform-tests within 1 hour web-platform-tests/results-collection#164

Closed

lukebjerring mentioned this issue Oct 23, 2017

Make manifest path and expectation root independent #7959

Closed

foolip mentioned this issue Nov 1, 2017

Fix setTimeout lint errors around media elements #8026

Merged

foolip mentioned this issue Nov 24, 2017

Add 'path' filter to /api/diff web-platform-tests/results-collection#300

Merged

lukebjerring mentioned this issue Nov 24, 2017

Surface API for diffing two runs web-platform-tests/results-collection#301

Closed

foolip mentioned this issue Dec 13, 2017

Meta: use tests and implementation bugs to shorten time to interoperability whatwg/html#1849

Open

4 tasks

foolip assigned mattl Jan 3, 2018

foolip mentioned this issue Jan 3, 2018

PR Build bot: show pass rate in summary screen? #6800

Closed

foolip mentioned this issue Mar 13, 2018

Prevent slow tests from landing #9972

Closed

foolip mentioned this issue Mar 14, 2018

When many tests are affected, CI stability jobs will time out #7660

Closed

This was referenced Mar 28, 2018

"Build PENDING" emails are confusing #7255

Closed

[css-grid] Fix resolution of percentage paddings and margins of grid items #10194

Merged

lukebjerring mentioned this issue Apr 10, 2018

Surface API for diffing two runs web-platform-tests/wpt.fyi#20

Closed

foolip mentioned this issue Apr 11, 2018

Synchronize browsers used by WPT CI and results collector web-platform-tests/results-collection#535

Open

foolip unassigned mattl Apr 24, 2018

foolip changed the title ~~Surface test regressions from Travis as review comments~~ Surface test regressions in PRs May 3, 2018

foolip mentioned this issue May 3, 2018

Store results of web-platform-tests PRs and allow comparing to master web-platform-tests/wpt.fyi#118

Closed

jugglinmike mentioned this issue May 14, 2018

Move stability checking to wpt run --verify #9874

Closed

zcorpan mentioned this issue May 15, 2018

WebCryptoAPI: Use .any.js for generateKey tests and split up slow tests #10984

Merged

foolip mentioned this issue Jul 23, 2018

webrtc: make transceiver tests work in Firefox #12141

Merged

foolip mentioned this issue Aug 3, 2018

Make harness errors block PRs (with override) #10877

Open

foolip mentioned this issue Aug 20, 2018

Convert Input Events IDL test to use idl_test #12572

Merged

foolip mentioned this issue Aug 31, 2018

Fix the error catching logic of should().throw() in audit.js #12606

Merged

gsnedders mentioned this issue Sep 3, 2018

Multicol width tests #11802

Merged

gsnedders mentioned this issue Oct 1, 2018

Ability to run all tests for some PRs #13263

Closed

foolip mentioned this issue Oct 1, 2018

Run affected tests in Edge and Safari for PRs #13299

Closed

lukebjerring closed this as completed Jan 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface test regressions in PRs #7475

Surface test regressions in PRs #7475

foolip commented Sep 25, 2017

RByers commented Sep 25, 2017 •

edited

Loading

foolip commented Sep 27, 2017

foolip commented Oct 1, 2017

jgraham commented Oct 17, 2017

foolip commented Oct 18, 2017

foolip commented Oct 18, 2017

jgraham commented Oct 18, 2017

mdittmer commented Oct 19, 2017 via email

foolip commented Oct 19, 2017

mdittmer commented Oct 19, 2017 via email

jgraham commented Oct 19, 2017

foolip commented Oct 19, 2017

foolip commented Jan 3, 2018

foolip commented Mar 13, 2018

mariestaver commented Mar 14, 2018

foolip commented May 9, 2018

lukebjerring commented May 9, 2018

foolip commented May 14, 2018

foolip commented Jul 16, 2018

lukebjerring commented Jul 17, 2018

foolip commented Jul 20, 2018 •

edited

Loading

gsnedders commented Sep 3, 2018

lukebjerring commented Oct 31, 2018

mdittmer commented Jan 25, 2019

lukebjerring commented Jan 25, 2019

Surface test regressions in PRs #7475

Surface test regressions in PRs #7475

Comments

foolip commented Sep 25, 2017

RByers commented Sep 25, 2017 • edited Loading

foolip commented Sep 27, 2017

foolip commented Oct 1, 2017

jgraham commented Oct 17, 2017

foolip commented Oct 18, 2017

foolip commented Oct 18, 2017

jgraham commented Oct 18, 2017

mdittmer commented Oct 19, 2017 via email

foolip commented Oct 19, 2017

mdittmer commented Oct 19, 2017 via email

jgraham commented Oct 19, 2017

foolip commented Oct 19, 2017

foolip commented Jan 3, 2018

foolip commented Mar 13, 2018

mariestaver commented Mar 14, 2018

foolip commented May 9, 2018

lukebjerring commented May 9, 2018

foolip commented May 14, 2018

foolip commented Jul 16, 2018

lukebjerring commented Jul 17, 2018

foolip commented Jul 20, 2018 • edited Loading

gsnedders commented Sep 3, 2018

lukebjerring commented Oct 31, 2018

mdittmer commented Jan 25, 2019

lukebjerring commented Jan 25, 2019

RByers commented Sep 25, 2017 •

edited

Loading

foolip commented Jul 20, 2018 •

edited

Loading