Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Add create_full_parser() for non-fragment parsing. #6977

Closed
wants to merge 15 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 6, 2024

Trac ticket: Core-61576

Description

As part of work to add more spec support to the HTML API, this patch adds support for the insertion modes from the initial start of a full document parse until IN BODY.

Modes after IN BODY are left to future work, but this change opens up the ability to start performing full document parses.

- Tests: 1525, Assertions: 3277, Skipped: 233.
+ Tests: 1519, Assertions: 3278, Skipped: 233.
Screenshot 2024-07-05 at 9 41 50 PM

Copy link

github-actions bot commented Jul 6, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the html-api/support-initial-mode branch from a74a5f5 to 730ba45 Compare July 6, 2024 23:47
Since the HTML Processor started visiting all nodes in a document, both
real and virtual, the breadcrumb accounting became a bit complicated
and it's not entirely clear that it is fully reliable.

In this patch the breadcrumbs are rebuilt separately from the stack of
open elements in order to eliminate the problem of the stateful stack
interactions and the post-hoc event queue.

Breadcrumbs are greatly simplified as a result, and more verifiably
correct, in this construction.
The HTML Processor internally throws an exception when it reaches HTML
that it knows it cannot process, but this exception is not made
available to calling code. It can be useful to extract more knowledge
about why it gave up, especially for debugging purposes.

In this patch, more context is added to the WP_HTML_Unsupported_Exception
and the last exception is made available to calling code, if it asks.
The HTML Processor has only supported a specific kind of parsing mode
called _the fragment parsing mode_, where it behaves in the same way
that `node.innerHTML = html` does in the DOM. This mode assumes a
context node and doesn't support parsing an entire document.

This patch introduces the full parsing mode interface but leaves the
implementation incomplete, preparing the way for further work to add
that additional support.

See Core-61576
@dmsnell dmsnell force-pushed the html-api/support-initial-mode branch from 730ba45 to 58e91ea Compare July 6, 2024 23:57
As part of work to add more spec support to the HTML API, this patch adds
support for the insertion modes from the initial start of a full document
parse until IN BODY.

Modes after IN BODY are left to future work, but this change opens up the
ability to start performing full document parses.

See #61576.
@dmsnell dmsnell force-pushed the html-api/support-initial-mode branch from 58e91ea to 339f701 Compare July 7, 2024 00:26
@dmsnell dmsnell marked this pull request as ready for review July 22, 2024 23:35
Copy link

github-actions bot commented Jul 22, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell dmsnell changed the title HTML API: Add support for insertion modes before IN BODY. HTML API: Add create_full_parser() for non-fragment parsing. Jul 22, 2024
dmsnell added a commit to sirreal/wordpress-develop that referenced this pull request Jul 29, 2024
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, there's a lot in this one! I've left many comments in questions.

One thing I noticed is that it seems like any document proceeds through a number of insertion modes that ultimately creates HTML > HEAD > BODY (even if empty). This PR does not seem to do that.


I would like to figure out the html5lib tests to get more coverage here, although it's fine to do that after this lands if you'd prefer. We'd need to disable this flag and change the processor (fragment or full) depending on each test case. It will require more changes to fix the tree representation.

*/
case 'html':
$contents = $this->get_modifiable_text();
if ( ' html' !== $contents ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have a todo comment following this, but I think this can be relaxed. I haven't found it in the specification, but browsers seem to match this case-insensitively. HTML, HtMl and htmL all seem to use no-quirks mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, but I don't see it as a stumbling block at this point. the reason I arbitrarily limited it is that we don't parse DOCTYPE declarations properly. I'd kind of rather wait until that happens to start doing more here. that is, I don't want to write half a parser here and then a full parser in the Tag Processor.

maybe casing isn't that big of a deal. I'll think about it.

* > Then, switch the insertion mode to "before html".
*/
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML;
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next_token does not seem to stop here, I'm unable to reach the doctype:

$p = WP_HTML_Processor::create_full_parser('<!DOCTYPE html><body>'); $p->next_token();
echo $p->get_token_type() . ' ' . $p->get_tag();
// #tag HTML

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. interestingly enough this is because we don't insert a node for the DOCTYPE declaration.

I'm not sure what to do either, because it's not accessible and it's not a child or parent of the root HTML node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "it's not accessible?"

My initial thoughts are that this is an important part of the document and it should be reached via an interface like next_token.

Comment on lines 1102 to 1114
/*
* > A character token that is one of U+0009 CHARACTER TABULATION,
* > U+000A LINE FEED (LF), U+000C FORM FEED (FF),
* > U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
*
* Parse error: ignore the token.
*/
if ( '#text' === $op ) {
$text = $this->get_modifiable_text();
if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
return $this->step();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spec this comes between comments and "html" start tag, any reason to move it here or could we respect the spec's order?

/*
* > Act as described in the "anything else" entry below.
*/
$is_excluded_closing_tag = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using flags, I like the goto approach we've used in other places to move directly to the referenced condition (this is also done in #7046).

case '-BODY':
case '-HTML':
case '-BR':
$is_excluded_closing_tag = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here regarding flags vs. goto.

Comment on lines +365 to +366
$processor->state->encoding = $known_definite_encoding;
$processor->state->encoding_confidence = 'certain';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to set these in the fragment parser as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now the fragment parser only supports UTF-8, which would be certain confidence. because that's inferred from the parent document, I don't feel like we need to add more there yet. it's still my intention only to support UTF-8, as I would prefer we get in the habit of converting before processing.

thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 only support seems like the right choice at this time.

It's good that we set the encoding and encoding confidence in the fragment parser. Before that change it was resulting in different behavior for <meta charset="anything"> tags between the full and fragment parsers. That's fixed by always setting certain encoding confidence.

Comment on lines 1605 to 1619
/*
* > A character token that is one of U+0009 CHARACTER TABULATION,
* > U+000A LINE FEED (LF), U+000C FORM FEED (FF),
* > U+000D CARRIAGE RETURN (CR), or U+0020 SPACE
*/
if ( '#text' === $op ) {
$text = $this->get_modifiable_text();
if ( strlen( $text ) === strspn( $text, " \t\n\f\r" ) ) {
// Insert the character.
$this->insert_html_element( $this->state->current_token );
return true;
}
}

$is_excluded_closing_tag = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same - I'd prefer goto over flag and could we handle #text in the switch?

Comment on lines 1551 to 1561
/*
* > An end tag whose tag name is "br"
*
* This should never happen, as the Tag Processor prevents showing a BR closing tag.
*/
case '-BR':
/*
* > Act as described in the "anything else" entry below.
*/
$is_excluded_closing_tag = true;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why </br> would get special handling here, but it's to prevent the end tag from being ignored. </br> has special handling to be treated as an opener.

It is handled correctly in this PR 👍

Comment on lines 1722 to 1731
$currently_at = $this->bookmarks[ $this->state->current_token->bookmark_name ];
$new_bookmark = $this->bookmark_token();
$this->bookmarks[] = new WP_HTML_Span( $currently_at->start + $currently_at->length, 0 );
$this->state->stack_of_open_elements->push(
new WP_HTML_Token(
$new_bookmark,
'BODY',
false
)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the insert_virtual_node method?

* @param string $label A string which may specify a known encoding.
* @return string|null Known encoding if matched, otherwise null.
*/
protected static function get_encoding( string $label ): ?string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct but seems to be unused in this PR, was that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was intentional, but maybe it's not necessary. if we support inferring the character encoding it would be useful.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ready to land.

  • I would like to expose the doctype token somehow.
  • I'd also like to figure out the full html processor and html5lib-tests.

These are related and can be addressed in a follow-up.

pento pushed a commit that referenced this pull request Jul 31, 2024
The HTML Processor has only supported a specific kind of parsing mode
called _the fragment parsing mode_, where it behaves in the same way
that `node.innerHTML = html` does in the DOM. This mode assumes a
context node and doesn't support parsing an entire document.

As part of work to add more spec support to the HTML API, this patch
introduces a full parsing mode, which can parse a full HTML document
from start to end, including the doctype declaration and head tags.

Developed in #6977
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.


git-svn-id: https://develop.svn.wordpress.org/trunk@58836 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Member Author

dmsnell commented Jul 31, 2024

Merged in [58836]
883146e

@dmsnell dmsnell closed this Jul 31, 2024
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Jul 31, 2024
The HTML Processor has only supported a specific kind of parsing mode
called _the fragment parsing mode_, where it behaves in the same way
that `node.innerHTML = html` does in the DOM. This mode assumes a
context node and doesn't support parsing an entire document.

As part of work to add more spec support to the HTML API, this patch
introduces a full parsing mode, which can parse a full HTML document
from start to end, including the doctype declaration and head tags.

Developed in WordPress/wordpress-develop#6977
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58836


git-svn-id: http://core.svn.wordpress.org/trunk@58232 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Jul 31, 2024
The HTML Processor has only supported a specific kind of parsing mode
called _the fragment parsing mode_, where it behaves in the same way
that `node.innerHTML = html` does in the DOM. This mode assumes a
context node and doesn't support parsing an entire document.

As part of work to add more spec support to the HTML API, this patch
introduces a full parsing mode, which can parse a full HTML document
from start to end, including the doctype declaration and head tags.

Developed in WordPress/wordpress-develop#6977
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58836


git-svn-id: https://core.svn.wordpress.org/trunk@58232 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell dmsnell deleted the html-api/support-initial-mode branch August 6, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants