Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta: use tests and implementation bugs to shorten time to interoperability #1849

Open
1 of 4 tasks
foolip opened this issue Oct 3, 2016 · 42 comments
Open
1 of 4 tasks

Comments

@foolip
Copy link
Member

foolip commented Oct 3, 2016

One year ago, I removed navigator.appCodeName and similar from workers, while ignoring @domenic's suggestion to write web-platform-tests. Today, we still don't have interop, and it even looks like things have moved in the direction of exposing more things.

This is not an unusual situation, small changes are easily overlooked by those who don't participate and easily forgotten by those who do. When the changes are (textually) minor changes in prose, it's even more likely to go unnoticed than IDL changes, which can in principle be tracked.

We can do better, and sometimes do: #1233 (comment)

The main goal of a standard is interop between implementations, so to help reach this goal more consistently and possibly faster, I propose a number of changes over time:

For all but the first step, tools and infrastructure could help immensely.

@foolip
Copy link
Member Author

foolip commented Oct 3, 2016

@domenic @annevk @zcorpan @sideshowbarker, what do you think of this overall direction? I'd like to prepare a PR for the first step, which I think will be useful in itself. There would be no requirement that the same people write/review the spec and test change, just that there are in fact tests. I think we should not require pass/fail status in implementations before we have some infrastructure for it, which something @RByers and I have talked about.

@annevk
Copy link
Member

annevk commented Oct 3, 2016

I think making sure issues are filed against implementations is actually more important. Implementations typically have the resources to provide tests for any changes they make too and are already moving in the direction of putting those in WPT.

There's also a couple of areas around navigation, document lifecycles, browsing contexts, etc. that are rather hard to write tests for. I'm not entirely sure what to do there.

@annevk
Copy link
Member

annevk commented Oct 3, 2016

But also that I'd like to see automated in some manner. My main concern is that we already lack the resources to define everything properly, putting more constraints on editors is not going to help with that.

@sideshowbarker
Copy link
Contributor

As somebody who’s spent time trying to help make sure WPT is as useful as it can be, I’d love to see more high-value tests going into it—and to see those tests helping to get better interoperability among engines. This is related to something that @RByers brought up during the WPT discussion we had at TPAC—about ensuring that for WPT we’re making it a priority to focus the effort on real interoperability problems.

So the proposal in the issue description sounds like something that could help with getting some good stuff into WPT and in focusing efforts on improving interoperability.

Actually I think the last three items/checkboxes listed in the issue description are things that anybody should already be doing when they write WPT for existing features with known interop problems. The one thing that’s not is the “Require normative changes to come with tests” item.

On that I agree with @annevk about it maybe being not so helpful to put further constraints on editors (and other contributors too). But I hope there’s some way to try it out with at least a few issues and see whether it ends up being too much of additional PITA or not.

@foolip
Copy link
Member Author

foolip commented Oct 3, 2016

I had begun to prepare #1850 before any of the above comments, submitting it as a reference for discussion.

@domenic
Copy link
Member

domenic commented Oct 3, 2016

In general I support this direction strongly. This seems like the single biggest thing we can do to boost our effectiveness as editors, as well as increasing the credibility of the specification.

I think we should discuss softening the requirement in two potential ways:

  • Features that are hard to write tests for such as navigation etc., as Anne mentioned. Often these can be managed, but sometimes it's just extremely hard or requires manual tests or similar. Maybe we want to allow exceptions for that sort of thing.
  • Large features or integrations, such as custom elements. Although these are becoming increasingly rare in HTML, I think we should recognize that when they occur, requiring the editor to write a comprehensive test suite is too much of a burden in addition to their job of writing a comprehensive spec, and we should instead look toward building a test suite as a collaborative effort between the editors and the engineering teams of browser vendors.

I'd envision these as ways in which you could petition for an exception, reminiscent of the Blink launch process's various "we would prefer if you had X, but if there's a good reason, explain it and the API owners may grant you your wish". (Where X is something like "positive signals from other vendors" or "sufficient devtools support" or similar.) The baseline would still be requiring tests, so it's either "tests or a good explanation as to why it is unreasonable to provide them." And in this case you'd of course be expected to file a bug on WPT at least.

@RByers
Copy link

RByers commented Oct 3, 2016

This sounds great, thank you! FWIW the Pointer Events working group has been mostly operating in such a mode for the past 6 months and it's been working alright there - although that's obviously a much simpler space than all of HTML. In practice it has ended up being the browser engineer who writes both spec pull requests, test pull requests and browser implementation changes, and the editors primarily do the reviews.

I agree with Domenic that there's no need for absolutism here. I see this as moving to a culture that's exactly like the testing culture in any healthy software project: There's an expectation on changes to either show how the change is adequately tested or explain why testing isn't worth the cost. Most changes tend to land with tests, and a reviewer who forgets to ask where the tests are is seen as not upholding their responsibility to product quality. But nobody is absolutist about it - it's a judgement call. It's often tempting to skip writing good tests testing because it feels like a hassle and often takes much longer than making the product change. But that just accrues technical debt which leads in the long run to severe product quality issues (which I'd argue the web is definitely suffering from). If somebody wants to do some rapid prototyping or other iteration that's not production-quality, they can do that in a branch. But when the focus is actually on building a great product (not just in shipping your features as fast as possible), then what's in master is expected to be production-quality, and production-quality product changes are always thoughtful about appropriate testing.

I disagree that impl bugs are more valuable than tests. How will someone even know which browser to file the bugs on if they haven't tested the current behavior? How will engine developers decide how to prioritize these bugs? How do we reduce the risk of browser engineers failing to understand what was really intended by long complex spec text? Instead all the engines are all already prioritizing investigating web-platform-tests which other browsers pass and they fail. Fundamentally I see this as the age-old spec question of what matters more: a beautiful specification document or running code? High quality tests are the number one protection against the lure of spec fiction.

To help ease the burden this new process places on editors/reviewer I can commit to funding some tooling improvements. In particular, I'm exploring having a tool for web-platform-test PRs which automatically runs the changed tests (with and without the patch) on a number of browsers and surfaces the results directly in the PR. I'm open to other suggestions for what tools Google should fund.

@foolip
Copy link
Member Author

foolip commented Oct 3, 2016

To expand a bit on the possible tools/infrastructure:

  • Learn which implementations fail those tests: Meta: use tests and implementation bugs to shorten time to interoperability #1849 (comment)
  • File issues for those implementations: Based on the results of the above, buttons to directly file implementations issues, with the option to edit the title/description.
  • Track those issues over time, revisiting spec issues if necessary: The simplest might be a cron job to check status of open issues filed using above method, highlighting the oldest issues.

@foolip
Copy link
Member Author

foolip commented Oct 3, 2016

Regarding "we already lack the resources to define everything properly," I agree there's some risk of shifting the burden of testing onto editors.

Bad scenario: an editor notices a 3 vs. 1 split in behavior where the shortest path to interop is clearly the majority behavior. No implementer is interested, so the editor has to write the test, files an issue, but it's never fixed.

Good scenario: an implementer finds a bug in an algorithm and submits a PR. Everyone agrees it makes sense, but since it's most urgent for the implementer in question, she does the web-platform-tests PR which is also used to test the code change. Another engine also fails the test, an issue is filed, and it's fixed before long.

I also agree that we should be pragmatic, for any case where up-front testing is not worth the effort, we can just make note of it and move on. There will still be some bad scenarios, but I suspect that will be outweighed by the benefits of understanding existing behavior before making a change.

Another thing that might help is if implementers prioritize web-platform-tests failures where they are the single failing engine.

foolip added a commit that referenced this issue Oct 3, 2016
@cdumez
Copy link

cdumez commented Oct 3, 2016

I also strongly support @foolip's proposal including the part about making sure that normative changes to the specification are covered by web-platform-tests. Web platform tests have been extremely useful to improve WebKit's compliance with the specifications (and interoperability by making it easier to test in other browsers). WebKit gets a lot of bug reports and triaging them is not always easy. It is also sometimes not easy to understand the specification change and tests help a lot with this.

I am not saying that this applies to all features (some are hard to write tests for) but hopefully this would be the general rule with exceptions. I think writing the tests does not necessarily have to be the editor's responsibility, especially in cases where the editor is specifying a feature suggested by developers a particular engine. What matters is that changes to the specification get timely coverage in Web-platform-tests.

Note that while working running web-platform-tests, I have noticed that it is not uncommon for some of them to be out-of-date with the latest specification. Hopefully, this would be less likely to happen with the proper processes in place.

@foolip foolip changed the title Process: use tests and implementation bugs to shorten time to interoperability Meta: use tests and implementation bugs to shorten time to interoperability Oct 3, 2016
@foolip
Copy link
Member Author

foolip commented Oct 3, 2016

I have softened the language in #1850 somewhat, to "For normative changes, you will typically be asked for a web-platform-tests PR testing the new behavior. If testing is not practical, please explain why and if appropriate file an issue to follow up later."

@annevk, is there anything you'd add to cover who is expected to write the tests? I think this is mostly a matter of getting the culture right, and that it's hard to spell out in much detail, but would like your thoughts.

@annevk
Copy link
Member

annevk commented Oct 4, 2016

"A web-platform-tests PR is typically required before a normative PR can land." would be a little more neutral as to who is responsible.

@annevk
Copy link
Member

annevk commented Oct 4, 2016

In practice it has ended up being the browser engineer who writes both spec pull requests, test pull requests and browser implementation changes, and the editors primarily do the reviews.

This is the nightmare scenario by the way.

foolip added a commit that referenced this issue Oct 4, 2016
foolip added a commit that referenced this issue Oct 4, 2016
@foolip
Copy link
Member Author

foolip commented Oct 4, 2016

Latest wording is "For normative changes, a corresponding web-platform-tests PR is typically required. The author and reviewer can be different than for the spec change. If testing is not practical, please explain why and if appropriate file an issue to follow up later."

For cases where only one person is motivated enough to write the spec and test PRs, if it's non-trivial, I think it's very reasonable to ask someone else to review the tests.

@annevk
Copy link
Member

annevk commented Oct 4, 2016

The other thing we need for this is to have generated copies of the standard post-PR. Otherwise reviewing the tests requires reviewing the change to the standard too. That would be good to have in general, also for reviewing the change to the standard, but this seems to make that more of a necessity.

foolip added a commit that referenced this issue Oct 4, 2016
@foolip
Copy link
Member Author

foolip commented Oct 4, 2016

Disclosure: @annevk had a VC earlier today. There some cases where requiring tests up front could be unhelpful:

  • An entirely new area, where writing a lot of tests without any implementation will scarcely result in quality tests.
  • Under-tested old areas, where testing the single thing you changed would require you to build the whole test infrastructure for the area – might be too demanding in some cases.
  • Changes that are hard to understand before merged to the spec, where testing and iterating after merge is more productive.

As long as we have some kind of tracking, it's fine to make exceptions for cases like this, and perhaps others.

So that we can learn as we go, I'm going to downgrade the language from "typically required" to "highly appreciated". Over the coming months, it'll be interesting to see what works and what does not. If the only outcome is that things that are trivial to test are tested, that's still an improvement.

@inikulin
Copy link
Member

inikulin commented Oct 5, 2016

@domenic I mean, shouldn't we add reference to them like it's done for WPT in 6408a8f?

@zcorpan
Copy link
Member

zcorpan commented Oct 5, 2016

Yeah... But would it not be better to document that in web-platform-tests?

@foolip
Copy link
Member Author

foolip commented Oct 5, 2016

How are browsers importing the html5lib-tests, if at all?

@gsnedders
Copy link
Member

@foolip WebKit and Blink both run html5lib-tests in mine and Philip's ancient running that's terrible as well as the imported compiled copies in web-platform-tests.

@domenic
Copy link
Member

domenic commented Oct 5, 2016

To me it sounds like the solution is to keep the advice in the HTML Standard as-is, but add a README to web platform tests (possibly a subdirectory of it) explaining that to get parsing tests into WPT, you need to go do something in some other repo?

@annevk
Copy link
Member

annevk commented Oct 6, 2016

Yeah, and possibly explaining what needs to be solved in WPT before the tests can be moved there. (So that if someone wants to tackle the issue @gsnedders mentioned they have a starting point.)

@RByers
Copy link

RByers commented Oct 7, 2016

In practice it has ended up being the browser engineer who writes both spec pull requests, test pull requests and browser implementation changes, and the editors primarily do the reviews.

This is the nightmare scenario by the way.

We haven't seen it be much of a problem in practice on the Chrome team. Certainly a better situation than having specs that don't match any implementation (which happens ALL the time) or not having any interop tests (and so implementations deviating from the specs in many ways people don't realize).

@RByers
Copy link

RByers commented Oct 7, 2016

There some cases where requiring tests up front could be unhelpful:
An entirely new area, where writing a lot of tests without any implementation will scarcely result in quality tests.

IMHO having text in the spec without any implementation anywhere indefinitely is unhelpful too. When no implementation exists for something in the spec, I'd argue that's a spec bug which should be tracked by an open issue. It's fine to land the spec changes prior to implementation, but our work as standards engineers isn't done until the spec actually reflects reality. So, to resolve the issue, either tests should land that pass on some implementation, or (if that doesn't happen in some reasonable timeframe) the spec text should be reverted to match implementations.

[Edited to clarify I'm not opposed to spec changes preceding implementation / testing]

@foolip
Copy link
Member Author

foolip commented Oct 7, 2016

having text in the spec without any implementation anywhere indefinitely is unhelpful too

Happily, this is no longer on the menu for WHATWG specs, or at least we've been removing cases that have come to light. I don't know what the optimal (eventual-interop-wise) process for entirely new areas is, but probably some kind of parallel progression of implementation, spec and tests.

Big picture, I think we need to somehow track the progression of new features on the platform. "Track those issues over time, revisiting spec issues if necessary" is in that vein, but I think we need to experiment a bit. If we require either tests or bugs for missing tests, maybe working backwards from long-standing test failures and long-open test bugs would suffice.

@annevk
Copy link
Member

annevk commented Oct 10, 2016

We haven't seen it be much of a problem in practice on the Chrome team.

Right, if one person/browser does all the things you don't notice the problems either. That was the point.

In any event, it seems we largely agree here.

@foolip
Copy link
Member Author

foolip commented Jan 11, 2017

Survey time! It's now been just over 3 months since 6408a8f where we began "appreciating" tests with spec changes.

Some other WHATWG specs have been handled in the same way, so @annevk suggested on IRC that maybe the time has come to elevate this to a generic WHATWG process.

But first, how has this been working in practice, and what most needs improvement?

The most active mergers of PRs have been @domenic @zcorpan, @annevk and @sideshowbarker. The most active PR authors in addition have been @yuyokk, @jungkees, @junov and @mikewest. I'd be very interested to hear how it's been on both sides.

For my own part, I've found it pretty much entirely positive. Knowing that others will now expect tests has prompted me to write some test for older changes that I otherwise wouldn't have. aaf2435 and ad778cb were the two PRs I merged, and I wrote one of them, and it worked well in those cases.

As for needed improvements, "Learn which implementations fail those tests" is definitely at the top of my list, and it's tooling that Blink's platform predictability will prioritize this quarter.

@zcorpan
Copy link
Member

zcorpan commented Jan 14, 2017

I have been on both sides and think that it has worked really well. Tests and bugs speed up the turnaround time a lot. Without this it could go years before a browser vendor picked up or objected to a change (even if someone from that browser had given an a-OK for the change). Now some changes have patches in browsers before the change has landed in the spec.

Writing tests also increases the quality of the spec, since problems become clear in the tests. It also seems reasonable to assume that the tests help getting interoperable implementations, which is the goal of the spec.

wpt-stability-checker is very useful in seeing actual results (and if the tests are stable); would be nice to have Safari TP and Edge tested as well.

Some kind of bug-filer to file the same bug in several browsers could be useful, but we typically need to search for an existing bug anyway, and it's not that hard to just copy/paste. Also typically one wants to pick a good component and maybe cc someone, so not sure how much it would help.

All in all, I think the extra work is well worth it, and it is probably a smarter use of time. When you write a spec change you understand the problem, so might as well write some tests before it's swapped out and forgotten about.

@domenic
Copy link
Member

domenic commented Jan 17, 2017

I also feel that this is working really really well. In addition to seconding everything @zcorpan wrote, I want to say that writing tests and filing bugs has increased my feeling of participation in the whole process. That is, I feel much more sure that when I make a spec change, I am doing all I can to get it implemented everywhere, in precisely the manner I intended.

One interesting thing that's come up is the work in URL and Streams where we maintain a "reference implementation" alongside the spec. The reference implementation is more or less a line-by-line transcription of the spec text. This allows us to update the reference implementation for any spec change, and make sure it passes the new tests. For URL in particular this has caught several incorrect spec changes (where the spec text was wrong, even if the test updates were right). See e.g. whatwg/url#196 (comment) and web-platform-tests/wpt#4405 (comment) . And for streams it has led to very high test coverage.

This kind of reference implementation approach requires a higher up-front cost and is only applicable in limited situations involving self-contained systems that are relatively easy to write from scratch. But I could see it extending to e.g. Encoding or MIME Sniffing in the future, or to specific self-contained parts of larger specs (e.g. mini parsers).

Regarding bug-filing improvements, I think the best thing we could do there is have some kind of "hub" page/wiki entry/markdown file that provides quick-links to all the relevant trackers. It could also include advice on how to get yourself edit-bugs access for different trackers, or how to label things to get them maximum attention. Also perhaps a link to https://browser-issue-tracker-search.appspot.com/ to help you search for dupes. This kind of info is especially important for spreading the process beyond those of us who know how to do this already. For Web IDL @tobie put together https://github.com/heycam/webidl/#filing-issues-elsewhere which is a nice start.

@annevk
Copy link
Member

annevk commented Jan 18, 2017

I was extremely skeptic and extremely wrong. This is indeed a great system.

I agree that it would be good to have more reference implementations to catch subtle/blatant bugs. (I want a different term for them though since other implementations should not use them as reference. The standard and tests should be used as reference always, ideally.)

@gsnedders
Copy link
Member

Historically, the opposition to reference implementations was two-fold: firstly, not wanting browser vendors to all just use a single implementation and instead independently implement it; secondly, not wanting to end up with all the issues disagreements between specification and implementation.

That said, there's been at least one notable example of specifying using executable code (SML in the ES4 drafts), which avoids both first and second risks (nobody's going to ship SML code!)… given much of what we're touching, is there much risk in using JS in the specification? Is there much risk of browsers just shipping that?

It becomes so much easier to start answering questions about specification interpretation and correctness when we have a specification in a formal language.

@foolip
Copy link
Member Author

foolip commented Sep 16, 2017

Coming back to this issue again, looks like we're still stuck at step 1 of the master plan. Step 2 is "Learn which implementations fail those tests". @bobholt, this would have to involve an enhanced PR results bot, do you have an open issue tracking this?

@bobholt
Copy link

bobholt commented Sep 21, 2017

@foolip In my mind, this is a function of http://results.web-platform-tests.org/. For the PR bot to add each browser's failures to the PR comment would make that comment miles long and potentially bump up against GitHub's content limits again (even if we use details elements). That information is available by browsing the WPT PR Status link and drilling down into specific browsers (e.g. https://pulls.web-platform-tests.org/job/18732.12). If there's a more useful way to display that information within the app, we could take a look at that.

@foolip
Copy link
Member Author

foolip commented Sep 22, 2017

The information that I think currently isn't in http://results.web-platform-tests.org/ is what the status was without the changes, and one cannot assume that wpt.fyi has the same result as the immediately preceding commit. So I think running the affected tests once without the changes is a prerequisite, would you agree?

@foolip
Copy link
Member Author

foolip commented Dec 13, 2017

We're planning Ecosystem Infra 2017 Q4 OKRs and I'd like to take the next steps on this. Breakdown of original 4 steps:

Appreciate web-platform-tests for normative changes

Done, this is the new normal, special thanks to @plehegar and @rakuco on this front. There are diminishing returns on policy changes, so now I'd like to focus on the day-to-day reality, to see how often it happens in practice and why. My idea is to start from https://foolip.github.io/day-to-day/ and look for commits and PRs with normative changes but no mention of tests. I don't expect that HTML will come up here much :)

Learn which implementations fail those tests

@lukebjerring is working on "Surface test regressions from Travis as review comments", and will be supported by @mattl to make the CI infrastructure highly reliable. Before long, for any wpt PR, we will know with confidence how that test change will affect the different browsers.

File issues for those implementations

I don't think we need shared infra for this. The above will support the WHATWG working mode, where browser bugs are already being filed when it's mostly likely to help.

@Hexcles is working on (opt-in) WPT Import Notification which will get failures into the triage queues of Chromium teams. This is experimental, but I think automatic bugs must be based on tip-of-tree status to have high accuracy. Too many invalid bugs will cause all of them to be ignored, in practice.

In Q1 we'll be keeping a lookout to see how frequent duplicate manual+automated bugs are, and which are of higher quality. (I'd bet manual.) Then we'll see if we need any heuristics to avoid filing duplicate bugs.

Track those issues over time, revisiting spec issues if necessary

@domenic brought up #2710 on IRC yesterday as an example of this. I think that tooling here would be very helpful, it's just too much work to keep track of implementation and test status over time. I think there are two ways to identify them:

  • Monitor the bugs that were filed and revisit if no movement
  • Monitor the tests that began failing and revisit if no movement

We could do both, but I think the second is more robust and would like to explore that first. In Q1 I'll be looking for tests added some time ago that are still failing everywhere, and go poking at one every week to understand why that is. Very old tests that got stuck in between completely failing and completely passing might also be interesting.

Everyone, thoughts on this overall direction, or any of the specifics?

@foolip
Copy link
Member Author

foolip commented Dec 13, 2017

A few other things that we are doing or might be doing in the testing space:

@Hexcles
Copy link

Hexcles commented Dec 13, 2017

Regarding "notifying new WPT failures", there is a true difficulty that we glossed over: what are "new failures"?

In each testharness.js-based test file, there can be many tests with results like PASS, FAIL, etc. The current heuristic I use is to count the number of "FAIL"s, which is of course inaccurate. An extreme option would be to notify people whenever the output text changes, but that's too noisy at the moment.

This problem boils down to the lack of identifiers of tests in WPT (the names of the tests are informative and not intended to be identifiers IIUC), which makes all time-based analysis (those that compare multiple WPT commits) hard, including time-series metrics (I actually had the exact same discussion with @mdittmer ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests