-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface test regressions in PRs #7475
Comments
+1, this has always been a key piece of what we were looking for from the PR tooling. @bobholt discusses it a little here, and we discussed it in the original PR validation brainstorming. When a test goes from failing to passing do we also want to surface it in the same way? Eg. I may make some test changes specifically to get the test passing on a number of browsers, it should be easy to tell at a glance if I succeeded at that. |
I think that we want to easily see the old and new status of any test that has been touched, including if the old or new status is "doesn't exist". Probably we will run into the GitHub comment size limit, and the full view will have to be on pulls.web-platform-tests.org. That should always be linked in a comment like #7506 (comment) Then, we should decide which kinds of changes warrant another comment pointing out that something unusual is happening. That includes at least going passing to failing in some browser, and of course going from non-flaky to flaky. |
I'm preparing a presentation for https://webengineshackfest.org/ and made this mockup: |
I have some concerns about this. The easiest solution would just be to run all the tests twice on travis. But we already have capacity issues and I don't fancy adding an extra job, or additional work to the existing jobs. I'm also not sure that this actually helps in most cases. In particular who will see these results? For gecko developers we want something like this, but on the bug we generate when the PR is imported. In theory if the data is generated upstream it might be possible to use that, but certainly having it as a GH comment alone isn't going to get visibility in the right places. I can see three possible parts to this:
|
In the first instance, on any PR on GitHub. Then, if people don't rebel, we'd try reflecting that same information into Chromium code review, so that if an in-flight Chromium change is going to regress the test for Firefox, we'd know about it, and fix it if it's not intentional. As for presentation, I think it'd be nice if we could have the results more in the shape of wpt.fyi both before and after the changes, and then some kind of diff summary view. @mdittmer, FYI, this is a bit similar to what we discussed with diffing whole runs today. (@jgraham, no concrete plans around that.) |
This will usually work, and I think we might be able to end up there with web-platform-tests/results-collection#164, but in the short term I wonder how noisy it would be, with wpt.fyi being slightly out of date, and not run in exactly the same way. Elsewhere we talked about teaching the stability checker about the timeout so that it would run fewer times if needed, and I think that might be a mitigation for the single extra run this change would introduce. (Or just reduce stability runs from 10 to 9.)
I think it's fine to just not do this for now, certainly changes in stability has never been what I've been looking for when wanting to understand what my wpt change did to the results. |
I think I would like to try this and see how often it fails before trying a different approach.
10 is already pretty low and we are still getting some unstable tests on import, so I'm reluctant to reduce this further (FWIW there is work at mozilla to add a "verify" mode to harnesses that runs each test 10x without a browser restart and then 5x with a restart and then the same again in "chaos mode" that randomises some internals around scheduling, network, etc. to try to increase the chance of hitting race conditions). |
IMHO wpt.fyi should host a service that can produce a report like this. We
may need the service to be authenticated; an attacker shouldn't be able to
overwhelm bots with requests to diff two arbitrary (but legitimate)
web-platform-tests revisions.
…On Wed, Oct 18, 2017 at 10:20 AM, jgraham ***@***.***> wrote:
This will usually work, and I think we might be able to end up there with
web-platform-tests/results-collection#164 <web-platform-tests/results-collection#164>,
but in the short term I wonder how noisy it would be, with wpt.fyi being
slightly out of date, and not run in exactly the same way.
I think I would like to try this and see how often it fails before trying
a different approach.
Elsewhere we talked about teaching the stability checker about the timeout
so that it would run fewer times if needed, and I think that might be a
mitigation for the single extra run this change would introduce. (Or just
reduce stability runs from 10 to 9.)
10 is already pretty low and we are still getting some unstable tests on
import, so I'm reluctant to reduce this further (FWIW there is work at
mozilla to add a "verify" mode to harnesses that runs each test 10x without
a browser restart and then 5x with a restart and then the same again in
"chaos mode" that randomises some internals around scheduling, network,
etc. to try to increase the chance of hitting race conditions).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7475 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABsWSJQH6hM5jcCIhP8ylBfIR79LilSjks5stgkjgaJpZM4PjJwk>
.
|
I think that's the right long term trajectory: to share the running infrastructure and to make the data available in a way that can be used to produce both GitHub comments, Gerrit comments, and a web UI that they'd link to. But, while tests are run on Travis, what we need is just the current results, and we'll have to filter+process them to produce the GitHub comment. |
Don't we also need previous results--i.e., without change applied?
…On Thu, Oct 19, 2017, 9:45 AM Philip Jägenstedt, ***@***.***> wrote:
I think that's the right long term trajectory: to share the running
infrastructure and to make the data available in a way that can be used to
produce both GitHub comments, Gerrit comments, and a web UI that they'd
link to.
But, while tests are run on Travis, what we need is just the current
results, and we'll have to filter+process them to produce the GitHub
comment.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7475 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABsWSMMwtKdqbQZ3qNmCmnnrYgteVpllks5st1JOgaJpZM4PjJwk>
.
|
I think you might each using "current" and "previous" to mean the same thing. |
@mdittmer, yes, we do need the results without the changes, the question is whether we get them by running the tests without the changes once on Travis or if we get them from wpt.fyi. The former will be correct and the latter will have some amount of hard-to-understand noise, but the setup will be more like how we want it to eventually be. I think that if we do compare with results from wpt.fyi, then we should try to apply the changes to the same commit that was run on wpt.fyi. But then that means that recovering from something broken in the wpt repo will take longer as wpt.fyi first has to catch up. |
Adding @mattl as assignee as this work will be owned by Bocoup going forward. |
@mariestaver, does it still make sense to have this assigned to @mattl, or should we revisit later? |
@foolip we do look at this regularly, but it's been repeatedly deprioritized in favor of having stable, frequent runs. If things go well with the current changes, we might be able to look at this by the end of the month...but given the issues that have been cropping up, I don't want to make promises. However, I assume that even if it can't be addressed in March, we still want it on our roadmap, so as long as the title & description are still accurate and useful, we can leave this issue where it is for now. Let me know if you have any questions or suggestions about any of that, and thanks! |
@lukebjerring, what do you think is currently the most likely way this will be achieved? Does it depend on #10503, or just on #9874, or some other path to the same goal? |
Neither; Depends on web-platform-tests/wpt.fyi#118 |
I see. Asked on #9874 (comment) if there's some necessary work to get it into a form suitable to submitting to wpt.fyi. |
@lukebjerring, this came up in priority:roadmap triage, and detecting regressions is an important current focus for us. Can you outline the issues standing in the way of resolving this at this point? |
Trivializing the work efforts, it's essentially:
|
In our current planning, at least plan A, this step has turned into:
If that doesn't work out, then perhaps we'll return to extracting results from Travis again. @lukebjerring, does that sound right? |
There's the case with infra changes (esp. to the runner, but also with things like testharness.js) where we want to run all the tests and compare results; we should at least make that possible somehow. |
This is now a wpt.fyi Project (https://github.com/web-platform-tests/wpt.fyi/projects/6) |
Ping from your friendly neighbourhood ecosystem infra rotation. @lukebjerring can you comment on this issue with a short update on what's done and still to-be-done for this? I know following the link to the project achieves that, but if we're not going to close this issue and say "look at the project instead" maybe posting a quick snapshot of the work here is in order. |
Closing this as fixed. Details of future work/improvements can be found in the project linked above. |
Currently, Travis will check if tests are flaky and if so fail the build. Otherwise, it is considered a pass and surfaced as a comment linking to details. Example: #7472
In order to allow test writers to more easily understand the impact of their changes, we should also clearly surface when a new test is failing, or when an existing test goes from passing to failing.
To do this, Travis also has to run the tests once without the changes applied.
The text was updated successfully, but these errors were encountered: