-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML API: Add create_full_parser()
for non-fragment parsing.
#6977
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
a74a5f5
to
730ba45
Compare
Since the HTML Processor started visiting all nodes in a document, both real and virtual, the breadcrumb accounting became a bit complicated and it's not entirely clear that it is fully reliable. In this patch the breadcrumbs are rebuilt separately from the stack of open elements in order to eliminate the problem of the stateful stack interactions and the post-hoc event queue. Breadcrumbs are greatly simplified as a result, and more verifiably correct, in this construction.
The HTML Processor internally throws an exception when it reaches HTML that it knows it cannot process, but this exception is not made available to calling code. It can be useful to extract more knowledge about why it gave up, especially for debugging purposes. In this patch, more context is added to the WP_HTML_Unsupported_Exception and the last exception is made available to calling code, if it asks.
The HTML Processor has only supported a specific kind of parsing mode called _the fragment parsing mode_, where it behaves in the same way that `node.innerHTML = html` does in the DOM. This mode assumes a context node and doesn't support parsing an entire document. This patch introduces the full parsing mode interface but leaves the implementation incomplete, preparing the way for further work to add that additional support. See Core-61576
730ba45
to
58e91ea
Compare
As part of work to add more spec support to the HTML API, this patch adds support for the insertion modes from the initial start of a full document parse until IN BODY. Modes after IN BODY are left to future work, but this change opens up the ability to start performing full document parses. See #61576.
58e91ea
to
339f701
Compare
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
create_full_parser()
for non-fragment parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, there's a lot in this one! I've left many comments in questions.
One thing I noticed is that it seems like any document proceeds through a number of insertion modes that ultimately creates HTML > HEAD > BODY
(even if empty). This PR does not seem to do that.
I would like to figure out the html5lib tests to get more coverage here, although it's fine to do that after this lands if you'd prefer. We'd need to disable this flag and change the processor (fragment or full) depending on each test case. It will require more changes to fix the tree representation.
*/ | ||
case 'html': | ||
$contents = $this->get_modifiable_text(); | ||
if ( ' html' !== $contents ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we have a todo comment following this, but I think this can be relaxed. I haven't found it in the specification, but browsers seem to match this case-insensitively. HTML
, HtMl
and htmL
all seem to use no-quirks mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine, but I don't see it as a stumbling block at this point. the reason I arbitrarily limited it is that we don't parse DOCTYPE declarations properly. I'd kind of rather wait until that happens to start doing more here. that is, I don't want to write half a parser here and then a full parser in the Tag Processor.
maybe casing isn't that big of a deal. I'll think about it.
* > Then, switch the insertion mode to "before html". | ||
*/ | ||
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML; | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next_token
does not seem to stop here, I'm unable to reach the doctype:
$p = WP_HTML_Processor::create_full_parser('<!DOCTYPE html><body>'); $p->next_token();
echo $p->get_token_type() . ' ' . $p->get_tag();
// #tag HTML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. interestingly enough this is because we don't insert a node for the DOCTYPE declaration.
I'm not sure what to do either, because it's not accessible and it's not a child or parent of the root HTML node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "it's not accessible?"
My initial thoughts are that this is an important part of the document and it should be reached via an interface like next_token
.
/* | ||
* > A character token that is one of U+0009 CHARACTER TABULATION, | ||
* > U+000A LINE FEED (LF), U+000C FORM FEED (FF), | ||
* > U+000D CARRIAGE RETURN (CR), or U+0020 SPACE | ||
* | ||
* Parse error: ignore the token. | ||
*/ | ||
if ( '#text' === $op ) { | ||
$text = $this->get_modifiable_text(); | ||
if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) { | ||
return $this->step(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spec this comes between comments and "html" start tag, any reason to move it here or could we respect the spec's order?
/* | ||
* > Act as described in the "anything else" entry below. | ||
*/ | ||
$is_excluded_closing_tag = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using flags, I like the goto approach we've used in other places to move directly to the referenced condition (this is also done in #7046).
case '-BODY': | ||
case '-HTML': | ||
case '-BR': | ||
$is_excluded_closing_tag = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here regarding flags vs. goto.
$processor->state->encoding = $known_definite_encoding; | ||
$processor->state->encoding_confidence = 'certain'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to set these in the fragment parser as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now the fragment parser only supports UTF-8
, which would be certain
confidence. because that's inferred from the parent document, I don't feel like we need to add more there yet. it's still my intention only to support UTF-8, as I would prefer we get in the habit of converting before processing.
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8 only support seems like the right choice at this time.
It's good that we set the encoding and encoding confidence in the fragment parser. Before that change it was resulting in different behavior for <meta charset="anything">
tags between the full and fragment parsers. That's fixed by always setting certain
encoding confidence.
/* | ||
* > A character token that is one of U+0009 CHARACTER TABULATION, | ||
* > U+000A LINE FEED (LF), U+000C FORM FEED (FF), | ||
* > U+000D CARRIAGE RETURN (CR), or U+0020 SPACE | ||
*/ | ||
if ( '#text' === $op ) { | ||
$text = $this->get_modifiable_text(); | ||
if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) { | ||
// Insert the character. | ||
$this->insert_html_element( $this->state->current_token ); | ||
return true; | ||
} | ||
} | ||
|
||
$is_excluded_closing_tag = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same - I'd prefer goto over flag and could we handle #text
in the switch?
/* | ||
* > An end tag whose tag name is "br" | ||
* | ||
* This should never happen, as the Tag Processor prevents showing a BR closing tag. | ||
*/ | ||
case '-BR': | ||
/* | ||
* > Act as described in the "anything else" entry below. | ||
*/ | ||
$is_excluded_closing_tag = true; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand why </br>
would get special handling here, but it's to prevent the end tag from being ignored. </br>
has special handling to be treated as an opener.
It is handled correctly in this PR 👍
$currently_at = $this->bookmarks[ $this->state->current_token->bookmark_name ]; | ||
$new_bookmark = $this->bookmark_token(); | ||
$this->bookmarks[] = new WP_HTML_Span( $currently_at->start + $currently_at->length, 0 ); | ||
$this->state->stack_of_open_elements->push( | ||
new WP_HTML_Token( | ||
$new_bookmark, | ||
'BODY', | ||
false | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the insert_virtual_node
method?
* @param string $label A string which may specify a known encoding. | ||
* @return string|null Known encoding if matched, otherwise null. | ||
*/ | ||
protected static function get_encoding( string $label ): ?string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct but seems to be unused in this PR, was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was intentional, but maybe it's not necessary. if we support inferring the character encoding it would be useful.
The spec handles doctype-comment-text, adjust implementation to match that order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems ready to land.
- I would like to expose the doctype token somehow.
- I'd also like to figure out the full html processor and html5lib-tests.
These are related and can be addressed in a follow-up.
The HTML Processor has only supported a specific kind of parsing mode called _the fragment parsing mode_, where it behaves in the same way that `node.innerHTML = html` does in the DOM. This mode assumes a context node and doesn't support parsing an entire document. As part of work to add more spec support to the HTML API, this patch introduces a full parsing mode, which can parse a full HTML document from start to end, including the doctype declaration and head tags. Developed in #6977 Discussed in https://core.trac.wordpress.org/ticket/61576 Props: dmsnell, jonsurrell. See #61576. git-svn-id: https://develop.svn.wordpress.org/trunk@58836 602fd350-edb4-49c9-b593-d223f7449a82
The HTML Processor has only supported a specific kind of parsing mode called _the fragment parsing mode_, where it behaves in the same way that `node.innerHTML = html` does in the DOM. This mode assumes a context node and doesn't support parsing an entire document. As part of work to add more spec support to the HTML API, this patch introduces a full parsing mode, which can parse a full HTML document from start to end, including the doctype declaration and head tags. Developed in WordPress/wordpress-develop#6977 Discussed in https://core.trac.wordpress.org/ticket/61576 Props: dmsnell, jonsurrell. See #61576. Built from https://develop.svn.wordpress.org/trunk@58836 git-svn-id: http://core.svn.wordpress.org/trunk@58232 1a063a9b-81f0-0310-95a4-ce76da25c4cd
The HTML Processor has only supported a specific kind of parsing mode called _the fragment parsing mode_, where it behaves in the same way that `node.innerHTML = html` does in the DOM. This mode assumes a context node and doesn't support parsing an entire document. As part of work to add more spec support to the HTML API, this patch introduces a full parsing mode, which can parse a full HTML document from start to end, including the doctype declaration and head tags. Developed in WordPress/wordpress-develop#6977 Discussed in https://core.trac.wordpress.org/ticket/61576 Props: dmsnell, jonsurrell. See #61576. Built from https://develop.svn.wordpress.org/trunk@58836 git-svn-id: https://core.svn.wordpress.org/trunk@58232 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Trac ticket: Core-61576
Description
As part of work to add more spec support to the HTML API, this patch adds support for the insertion modes from the initial start of a full document parse until IN BODY.
Modes after IN BODY are left to future work, but this change opens up the ability to start performing full document parses.