Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define speculative HTML parsing #5959

Merged
merged 6 commits into from
Sep 14, 2021
Merged

Define speculative HTML parsing #5959

merged 6 commits into from
Sep 14, 2021

Conversation

@zcorpan zcorpan marked this pull request as ready for review October 29, 2020 16:27
@zcorpan zcorpan changed the title Define speculative HTML parsing (WIP) Define speculative HTML parsing Oct 29, 2020
@chrishtr
Copy link
Contributor

@domenic could you review, and @mfreed7 also?

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall quite solid, and I'd be pretty happy with merging it as-is. I tried to find ways to polish it and commented on those.

More implementer weigh-in would be ideal (specifically @hsivonen would be great), since I don't feel confident about that aspect of the review.

source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated

<p class="note">It is possible that the same markup is seen multiple times from the
<span>speculative HTML parser</span> and then the normal HTML parser. It is expected that
duplicated fetches will be prevented by normal caching rules.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"normal caching rules" sounds like it's referring to well-specified caching rules. But I think in reality they're prevented by unspecified memory caches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, Gecko at present tries a speculative fetch at most once for each unique URL seen during a page load regardless of "normal caching rules".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated +1 to what @domenic said - at the very least, we need to acknowledge that the rules are currently unspecified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsivonen do you think the spec should suggest that strategy for speculative fetches?

source Outdated Show resolved Hide resolved
source Outdated
<var>speculativeParser</var>.</p></li>

<li><p><span>In parallel</span>, run <var>speculativeParser</var> until it is stopped or until it
reaches the end of its <span>input stream</span>.</p></li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do speculative parsers stop? I thought they'd just kind of try to keep going, ignoring scripts or similar...

Or, can speculative parsers have their own speculative parsers? That might be worth calling out explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when the normal parser starts parsing again, the speculative parser is stopped. At least I think this is true for Gecko, less sure about Chromium and WebKit. cc @hsivonen @mfreed7

In Gecko, I think a speculative parser can have a speculative parser, but I didn't do that in the spec to make the model a bit simpler. If a parser-blocking script document.writes another parser-blocking script, the spec starts over speculative parsing altogether. The main "win" for the spec is that it doesn't need to check whether the document.write did something that invalidates existing speculations. This can still be added in in the future if implementers would like to have that specified.

source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated

<p class="note">It is possible that the same markup is seen multiple times from the
<span>speculative HTML parser</span> and then the normal HTML parser. It is expected that
duplicated fetches will be prevented by normal caching rules.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated +1 to what @domenic said - at the very least, we need to acknowledge that the rules are currently unspecified.

source Outdated Show resolved Hide resolved
source Outdated

<li>
<p>If the <span>speculative HTML parser</span> encounters one of the following elements, then
act as if that element is processed for the purpose of its effect of speculative fetches for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Chromium and WebKit's implementations deal with tags (and their tokens), not elements. Dunno if it matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tokens or tags seems slightly more correct from the spec perspective too; WDYT @zcorpan?

Base automatically changed from master to main January 15, 2021 07:58
@zcorpan
Copy link
Member Author

zcorpan commented Jan 21, 2021

@domenic @yoavweiss I've tried to address the feedback by being a bit more specific and inventing a concept of "speculative mock elements", which can't cause things to happen.

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is pretty inventive. After staring at it for a bit, I think it works, and I like it!!

Still some other minor code review comments to resolve, but I think this is a very clean way of semi-rigorously building up a tree structure, while making it clear none of the normal mechanisms happen.

source Outdated Show resolved Hide resolved
@zcorpan
Copy link
Member Author

zcorpan commented Jan 22, 2021

@domenic thanks!

Another reason for this approach is that e.g. the Adoption Agency Algorithm can mutate parts of the DOM tree before the parser-blocking script, which would be bad to do speculatively on the real DOM tree.

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a couple of minor things!

source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
source Outdated Show resolved Hide resolved
@zcorpan
Copy link
Member Author

zcorpan commented Jan 25, 2021

@domenic thanks, fixed!

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! It seems you've got a couple open threads left with @mfreed7 and @hsivonen; I'll let you decide whether you want to wait on resolving those or merge and potentially revisit later if they come back with feedback.

@zcorpan
Copy link
Member Author

zcorpan commented Jan 25, 2021

I'd ideally like a LGTM from at least two implementers here.

source Outdated

<ul>
<li>
<p>The state of the normal HTML parser and the document itself must not be affected.</p>
Copy link
Member

@andreubotella andreubotella Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of the input stream and the input byte stream probably needs to be specified with more detail – since tokens pushed into the input (byte) stream must also be pushed into the speculative parser's stream, but tokens read from the streams should be independent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 6a20e19

This list is hand-wavy, certainly, but the feedback from implementers is to not specify the speculative parser in precise detail as the approaches used in different engines are different and currently we're mostly concerned about observable differences and that this optimization is specified at all.

@mfreed7
Copy link
Contributor

mfreed7 commented Feb 13, 2021

So overall I like this spec change. In essence, the preload scanner shouldn't load something that the "real" parser wouldn't eventually load also, assuming no document.write()'s change things in the meantime. As a spec, I think that makes sense, and due to Gecko's use of the actual parser / tree builder, most of the tests pass there. On the Chromium side, the implementation is quite separate, with the preload scanner being a very thin approximation of the real tree builder. As such, more of the tests fail, and some might be tough to fix. One alternative is to re-architect the preload scanner to use the real tree builder, but I don't think that's worth the effort. Which means the "fix" for Chromium will amount to trying to patch the holes in the approximations being made in the existing preload scanner. I'm ok with that. There's already a nice bug summarizing those errors for us.

TL;DR, I'm supportive of landing this PR along with the test PR.

source Outdated Show resolved Hide resolved
@hsivonen
Copy link
Member

Thanks for working on this.

The spec seems to allow speculative fetches only when a script is blocking the parser. In Firefox, speculative fetches can happen even when the parser is not blocked by script but DOM building actions are being accumulated into a larger batch of work and speculative fetches are started during that accumulation.

I think this is relevant to whether Firefox complies to the spec text in the as-if sense in case DOM changes come from other sources than parser-blocking scripts (e.g. async scripts or timeouts set by a same-origin parent). Specifically, if something else disconnects a node such that the parser keeps inserting nodes that are not in the document and this causes some non-speculative fetches not to occur, Firefox still performs speculative fetches. (Images are fetched even if disconnected, but I think there are other fetch types where the non-speculative case requires the node to be in the document.)

I think it should be conforming to perform speculative fetches at any time on the assumption that no scripted action, regardless whether from a parser-blocking script or not, does anything.

In Firefox, duplicate request avoidance doesn't go all the way to the cache but the speculative load machinery refuses to speculatively fetch the same URL twice.

I don't see spec text connecting the creation of speculative mock elements to speculative fetches.

@zcorpan
Copy link
Member Author

zcorpan commented Apr 20, 2021

@hsivonen

In Firefox, duplicate request avoidance doesn't go all the way to the cache but the speculative load machinery refuses to speculatively fetch the same URL twice.

Is it keyed off of only the URL, or a tuple of URL, crossorigin attribute's state, referrerpolicy attribute's state, or some such?

@zcorpan
Copy link
Member Author

zcorpan commented Apr 22, 2021

The spec seems to allow speculative fetches only when a script is blocking the parser. In Firefox, speculative fetches can happen even when the parser is not blocked by script but DOM building actions are being accumulated into a larger batch of work and speculative fetches are started during that accumulation.

I think this is relevant to whether Firefox complies to the spec text in the as-if sense in case DOM changes come from other sources than parser-blocking scripts (e.g. async scripts or timeouts set by a same-origin parent). Specifically, if something else disconnects a node such that the parser keeps inserting nodes that are not in the document and this causes some non-speculative fetches not to occur, Firefox still performs speculative fetches. [...]

I think it should be conforming to perform speculative fetches at any time on the assumption that no scripted action, regardless whether from a parser-blocking script or not, does anything.

I'm happy to allow it. I think we need to separate speculative parsing and speculative fetching a bit more in the spec, so normal parsing can also cause speculative fetches...

(Images are fetched even if disconnected, but I think there are other fetch types where the non-speculative case requires the node to be in the document.)

Yes, for example <link rel=stylesheet>, <link rel=modulepreload>.

<video poster> seems to not require the element to be browsing-context connected per spec: https://html.spec.whatwg.org/multipage/media.html#poster-frame

The spec allows fetching <script src> when the src attribute is set, before the element is connected:

"For performance reasons, user agents may start fetching the classic script or module graph (as defined above) as soon as the src attribute is set, instead, in the hope that the element will be inserted into the document (and that the crossorigin attribute won't change value in the meantime)."
https://html.spec.whatwg.org/multipage/scripting.html#prepare-a-script

I can't tell for <link rel=preload>, SVG <image>.

I don't see spec text connecting the creation of speculative mock elements to speculative fetches.

Yeah, this is maybe too handwavily implied currently. I'll try to address it along with allowing regular-parsing speculative fetches.

@zcorpan
Copy link
Member Author

zcorpan commented May 20, 2021

I've rebased on current main to resolve merge conflicts.

@hsivonen
Copy link
Member

In Firefox, duplicate request avoidance doesn't go all the way to the cache but the speculative load machinery refuses to speculatively fetch the same URL twice.

Is it keyed off of only the URL, or a tuple of URL, crossorigin attribute's state, referrerpolicy attribute's state, or some such?

URL only with the twist that if there is a media query that doesn't apply, the URL is ignored instead of listed as already loaded.

This might not be a good idea. This code predates the introduction of the crossorigin attribute and the referrerpolicy attribute.

@zcorpan
Copy link
Member Author

zcorpan commented Aug 18, 2021

@hsivonen I've addressed your comments, except I haven't specified the media query twist. I specified the duplicate fetch prevention to be URL only for now.

@zcorpan
Copy link
Member Author

zcorpan commented Aug 31, 2021

Arguably, "Let url be the URL that element would fetch if it was processed normally" covers the media query aspect (as well as type attribute, etc), though it's not very explicit.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think the spec should allow speculation while looking for meta charset and the spec should allow fetches that the normal parser will see in the future to start speculatively. This could be explained by starting a parallel speculative parser at the start of the document with the HTTP-layer encoding, if there is one, or, otherwise, the inherited encoding, if there is one, or, otherwise, UTF-8.

source Show resolved Hide resolved
source Show resolved Hide resolved
Copy link
Contributor

@mfreed7 mfreed7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me. I like the "speculative mock element" concept - makes the spec pretty straightforward here. I added a few small comments, but overall LGTM.

source Show resolved Hide resolved
source Show resolved Hide resolved
source Show resolved Hide resolved
source Show resolved Hide resolved
@zcorpan zcorpan merged commit 92a152c into main Sep 14, 2021
@zcorpan zcorpan deleted the bocoup/speculative-parsing branch September 14, 2021 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Specify speculative HTML parsing (preload scanner)
8 participants