Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Allow additional fragment contexts. #7141

Open
wants to merge 14 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Aug 5, 2024

Trac-ticket: Core-61576.

Status

  • Add unit tests covering significant contexts, e.g. SCRIPT, TEXTAREA, SVG.
    • It shouldn't be possible to create fragment parsers on void or self-contained elements. There's no content inside of them to parse. For self-contained elements it's appropriate to rely on the eventual set_inner_html() which will be an alias in those cases for set_inner_text().
  • Review all documentation looking for places stating that <body> is the only supported context.

Description

Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the <body> context. In this patch, any context node is allowed.

Previously, the fragment parser in WP_HTML_Processor has only allowed
creating a fragment with the `<body>` context. In this patch, any
context node is allowed.
@dmsnell dmsnell force-pushed the html-api/allow-any-fragment-context branch from 41dc4aa to 691d39e Compare August 5, 2024 22:41
Copy link

github-actions bot commented Aug 5, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell marked this pull request as ready for review August 5, 2024 23:21
Copy link

github-actions bot commented Aug 5, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, apermo, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell
Copy link
Member Author

dmsnell commented Aug 6, 2024

@sirreal it looks like we may have a lot of work to do before we can open this up to allowing self-contained elements as the fragment context. there are multiple challenges:

  • we have to enter within a certain tokenizer state, but that's not how our parser is designed.
  • the DOM is able to store things like the text </textarea> inside a TEXTAREA node, but HTML is not able to do this. we have to determine what to do in these cases.
  • if the closing tag for the self-contained element isn't provided, we cannot parse it to figure out which context node it is. e.g. <script> bails with the tag processor.

@sirreal
Copy link
Member

sirreal commented Aug 12, 2024

The fragment parser takes an Element. An HTML string seems like a poor representation for that 🙂

I wonder if the fragment parser could take a token (to create a fragment from an existing document) or some other representation that's not just a string in order to create fragments without needing to create a full document first.

The following steps form the HTML fragment parsing algorithm. The algorithm takes as input an Element node, referred to as the context element, which gives the context for the parser, input, a string to parse, and an optional boolean allowDeclarativeShadowRoots (default false). It returns a list of zero or more nodes.

Copy link

@apermo apermo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you asked me, here are my thoughts.

One additional one. Do you want to add a blacklist of forbidden subnodes? Or do you think this is the responsibility of the developer?

Like preventing to add <html>, <body> to body, or pretty much everything to title.

Not sure wether this would be the right place to do so, but I had that thought and wanted to share it.

return null;
}

$context_processor = new WP_HTML_Tag_Processor( $context );
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how deep you want to comment, but a comment here could help understanding what you were up to.


$context_tag = $context_processor->get_tag();
$context_attributes = array();
foreach ( $context_processor->get_attribute_names_with_prefix( '' ) as $name ) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an empty prefix here looks strange.
If it was my code I'd either have the prefix optional/default to empty or add a get_attributes_names() which would do exactly the same as get_attributes_names_with_prefix( '' ), might be redundant in a way, but for me it would look a little more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @apermo. if this looks strange, it's supposed to 🙃
when we added this method we wanted it to be a bit awkward in the hopes of communicating that it involves additional costs. we wanted the default behavior to be looking for a subset of attributes (such as get_attribute_names_with_prefix( 'data-' )) so that the API itself guides people to learn in the safest most performant manner.

Comment on lines 293 to 294
* @param string $context Context element for the fragment, must be default of `<body>`.
* @param string $encoding Text encoding of the document; must be default of 'UTF-8'.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must be default of ..."?
My internal autocomplete expected an "or" here. Which likely is equivalent, but I was expecting it.

And I think at least for $context you need to update it, since your change was about allowing other than body.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for noticing.

must be default of UTF-8 and must be default or UTF-8 are very different, whereas only the default value has been allowed (and the default is UTF-8). some day we might open it up to other values, but this is there to communicate intentionally that this is a UTF-8-only interface at the moment.

return null;
}

$processor = new static( $html, self::CONSTRUCTOR_UNLOCK_CODE );
$processor->state->context_node = array( 'BODY', array() );
$processor->state->context_node = array( $context_tag, $context_attributes );
$processor->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY;
Copy link

@apermo apermo Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood your PR correctly, it's about allowing insertion to anything other than body, or is body in this case ambivalent?

So content body vs <body>?

Anyways, I'm uncertain wether this is intentional or if you forgot to touch this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fragment parser is synonymous with node.innerHTML = newHtml in JavaScript or with a DOM. in this case, <body> has been the default as a reasonably safe default for existing code, which likely is getting only a small chunk of the actual site HTML and then trying to process it.

so if you knew you were inside a <li> you would create the fragment parser in the <li> context and then the parse would change based on that. I probably don't fully understand this, because it's hard for me to come up with situations that lead to different parses, but it comes into play when inside SVGs or MathML elements, and when resetting some internals (the insertion mode).

basically this is not something most people will need to use, but it will be used by set_inner_html() to ensure appropriate parsing.

@dmsnell dmsnell force-pushed the html-api/allow-any-fragment-context branch from 97288b3 to 4c652ad Compare August 21, 2024 01:41
@dmsnell dmsnell changed the title HTML API: Allow any fragment context. HTML API: Allow additional fragment contexts. Aug 21, 2024
@dmsnell
Copy link
Member Author

dmsnell commented Aug 21, 2024

@sirreal: upon further review I feel like this is not appropriate for self-contained and void elements, so I am rejecting them. there's never a reason, I think, to create a fragment parser on an element which can contain no child nodes (other than text, potentially).

One case I was specifically thinking through is a <script> context. The idea is simple: parse the inner HTML of a SCRIPT element. The problem is when </script> appears and there is more content. In the DOM you can actually create a text node containing </script> but this is not possible in the HTML. The spec even notes situations like this where the DOM can create trees that cannot be expressed in HTML.

So I think that long-term the only use for the fragment parser outside of the usual is when setting inner HTML, and in that case, it seems more fitting for that function to call out something special for elements like SCRIPT so that it falls into the equivalent of ->skip_script_data().

Apart from this I don't like exposing a token as the context node because that's cumbersome for people to write, and I don't expect people to want to go through the hassle of what that implies. Our tokens don't even contain attribute references, so this is not currently workable.

However, it does potentially solve some issues around namespacing. There's no way in this function's argument to specify a context node in the svg or math namespace.

@sirreal
Copy link
Member

sirreal commented Aug 22, 2024

I'll share some context from a conversation.


As mentioned, it likely doesn't make sense to create a fragment parser in order to modify these "self-contained" tags and better interfaces can be provided.


The specification for fragment parsers are always created from another document. This should be easy to do in the HTML API. However, most of the time WordPress is inspecting or modifying HTML snippets without considering their context. Rarely are full HTML documents considered. In order to continue supporting this behavior, we need to be able to create fragments without a parent document. This is where most of the difficulty arises. Most of the time, <body> provides a suitable context, but there could be other cases like <head>. Again, this seems relatively straightforward to support. However, it gets more complicated with things like <svg> as a context element, or especially <rect>, or even trickier a script tag in the SVG namespace!

The context element will be used to establish the correct insertion mode and the context element's namespace and possibly attributes determine how tokens should be handled (using rules for foreign content or one of the HTML insertion modes).

It may make sense to maintain this interface as-is. <body> is likely a good context for most use-cases but in order to properly support fragments some advanced interface likely needs to support:

  • Passing a context element along with its tag name, namespace, and attributes.
  • The context element's document determines the quirks mode of the fragment parser.

@sirreal
Copy link
Member

sirreal commented Aug 26, 2024

in order to properly support fragments some advanced interface likely needs to support…

I had a thought that may be helpful.

I'd like to add an instance method to the HTML processor like create_fragment_parser_at_node( string $html_fragment ). This method would return a new HTML processor, created as a fragment parser with the context element created from the current tag opener (or null). This would allow use to set the document compat type and the context element and its attributes correctly according to the specification.

Once we have this method, advanced fragment parser creation could be handled by that method:

$full_parser = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><math><annotation-xml encoding="text/html">' );
$full_parser->next_tag( 'ANNOTATION-XML' );
$fragment_parser = $full_parser->create_fragment_parser_at_node( `<h1>Who knows what happens here?' );
$fragment_parser->next_tag( 'H1' );
// …

I think internally this method would be used for things like set_inner_html that require a fragment parser.

@sirreal
Copy link
Member

sirreal commented Sep 13, 2024

See #7348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants