HTML API: Allow additional fragment contexts. #7141

dmsnell · 2024-08-05T22:39:07Z

Status

~~Add unit tests covering significant contexts, e.g. SCRIPT, TEXTAREA, SVG.~~
- It shouldn't be possible to create fragment parsers on void or self-contained elements. There's no content inside of them to parse. For self-contained elements it's appropriate to rely on the eventual set_inner_html() which will be an alias in those cases for set_inner_text().
Review all documentation looking for places stating that <body> is the only supported context.

Description

Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the <body> context. In this patch, any context node is allowed.

Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the `<body>` context. In this patch, any context node is allowed.

github-actions · 2024-08-05T22:55:18Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions · 2024-08-05T23:21:44Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, apermo, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

dmsnell · 2024-08-06T05:43:25Z

@sirreal it looks like we may have a lot of work to do before we can open this up to allowing self-contained elements as the fragment context. there are multiple challenges:

we have to enter within a certain tokenizer state, but that's not how our parser is designed.
the DOM is able to store things like the text </textarea> inside a TEXTAREA node, but HTML is not able to do this. we have to determine what to do in these cases.
if the closing tag for the self-contained element isn't provided, we cannot parse it to figure out which context node it is. e.g. <script> bails with the tag processor.

sirreal · 2024-08-12T12:39:07Z

The fragment parser takes an Element. An HTML string seems like a poor representation for that 🙂

I wonder if the fragment parser could take a token (to create a fragment from an existing document) or some other representation that's not just a string in order to create fragments without needing to create a full document first.

The following steps form the HTML fragment parsing algorithm. The algorithm takes as input an Element node, referred to as the context element, which gives the context for the parser, input, a string to parse, and an optional boolean allowDeclarativeShadowRoots (default false). It returns a list of zero or more nodes.

apermo

As you asked me, here are my thoughts.

One additional one. Do you want to add a blacklist of forbidden subnodes? Or do you think this is the responsibility of the developer?

Like preventing to add <html>, <body> to body, or pretty much everything to title.

Not sure wether this would be the right place to do so, but I had that thought and wanted to share it.

apermo · 2024-08-20T07:03:33Z

src/wp-includes/html-api/class-wp-html-processor.php

+			return null;
+		}
+
+		$context_processor = new WP_HTML_Tag_Processor( $context );


Not sure how deep you want to comment, but a comment here could help understanding what you were up to.

apermo · 2024-08-20T07:06:18Z

src/wp-includes/html-api/class-wp-html-processor.php

+
+		$context_tag        = $context_processor->get_tag();
+		$context_attributes = array();
+		foreach ( $context_processor->get_attribute_names_with_prefix( '' ) as $name ) {


Having an empty prefix here looks strange.
If it was my code I'd either have the prefix optional/default to empty or add a get_attributes_names() which would do exactly the same as get_attributes_names_with_prefix( '' ), might be redundant in a way, but for me it would look a little more readable.

thanks @apermo. if this looks strange, it's supposed to 🙃
when we added this method we wanted it to be a bit awkward in the hopes of communicating that it involves additional costs. we wanted the default behavior to be looking for a subset of attributes (such as get_attribute_names_with_prefix( 'data-' )) so that the API itself guides people to learn in the safest most performant manner.

apermo · 2024-08-20T07:08:54Z

src/wp-includes/html-api/class-wp-html-processor.php

 	 * @param string $context  Context element for the fragment, must be default of `<body>`.
 	 * @param string $encoding Text encoding of the document; must be default of 'UTF-8'.


"must be default of ..."?
My internal autocomplete expected an "or" here. Which likely is equivalent, but I was expecting it.

And I think at least for $context you need to update it, since your change was about allowing other than body.

thanks for noticing.

must be default of UTF-8 and must be default or UTF-8 are very different, whereas only the default value has been allowed (and the default is UTF-8). some day we might open it up to other values, but this is there to communicate intentionally that this is a UTF-8-only interface at the moment.

apermo · 2024-08-20T07:14:34Z

src/wp-includes/html-api/class-wp-html-processor.php

 			return null;
 		}

 		$processor                             = new static( $html, self::CONSTRUCTOR_UNLOCK_CODE );
-		$processor->state->context_node        = array( 'BODY', array() );
+		$processor->state->context_node        = array( $context_tag, $context_attributes );
 		$processor->state->insertion_mode      = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY;


If I understood your PR correctly, it's about allowing insertion to anything other than body, or is body in this case ambivalent?

So content body vs <body>?

Anyways, I'm uncertain wether this is intentional or if you forgot to touch this.

the fragment parser is synonymous with node.innerHTML = newHtml in JavaScript or with a DOM. in this case, <body> has been the default as a reasonably safe default for existing code, which likely is getting only a small chunk of the actual site HTML and then trying to process it.

so if you knew you were inside a <li> you would create the fragment parser in the <li> context and then the parse would change based on that. I probably don't fully understand this, because it's hard for me to come up with situations that lead to different parses, but it comes into play when inside SVGs or MathML elements, and when resetting some internals (the insertion mode).

basically this is not something most people will need to use, but it will be used by set_inner_html() to ensure appropriate parsing.

This cleans up the actual parsing and makes it easier to accomodate self-contained context elements.

dmsnell · 2024-08-21T02:40:19Z

@sirreal: upon further review I feel like this is not appropriate for self-contained and void elements, so I am rejecting them. there's never a reason, I think, to create a fragment parser on an element which can contain no child nodes (other than text, potentially).

One case I was specifically thinking through is a <script> context. The idea is simple: parse the inner HTML of a SCRIPT element. The problem is when </script> appears and there is more content. In the DOM you can actually create a text node containing </script> but this is not possible in the HTML. The spec even notes situations like this where the DOM can create trees that cannot be expressed in HTML.

So I think that long-term the only use for the fragment parser outside of the usual is when setting inner HTML, and in that case, it seems more fitting for that function to call out something special for elements like SCRIPT so that it falls into the equivalent of ->skip_script_data().

Apart from this I don't like exposing a token as the context node because that's cumbersome for people to write, and I don't expect people to want to go through the hassle of what that implies. Our tokens don't even contain attribute references, so this is not currently workable.

However, it does potentially solve some issues around namespacing. There's no way in this function's argument to specify a context node in the svg or math namespace.

sirreal · 2024-08-22T13:55:22Z

I'll share some context from a conversation.

As mentioned, it likely doesn't make sense to create a fragment parser in order to modify these "self-contained" tags and better interfaces can be provided.

The specification for fragment parsers are always created from another document. This should be easy to do in the HTML API. However, most of the time WordPress is inspecting or modifying HTML snippets without considering their context. Rarely are full HTML documents considered. In order to continue supporting this behavior, we need to be able to create fragments without a parent document. This is where most of the difficulty arises. Most of the time, <body> provides a suitable context, but there could be other cases like <head>. Again, this seems relatively straightforward to support. However, it gets more complicated with things like <svg> as a context element, or especially <rect>, or even trickier a script tag in the SVG namespace!

The context element will be used to establish the correct insertion mode and the context element's namespace and possibly attributes determine how tokens should be handled (using rules for foreign content or one of the HTML insertion modes).

It may make sense to maintain this interface as-is. <body> is likely a good context for most use-cases but in order to properly support fragments some advanced interface likely needs to support:

Passing a context element along with its tag name, namespace, and attributes.
The context element's document determines the quirks mode of the fragment parser.

sirreal · 2024-08-26T19:22:54Z

in order to properly support fragments some advanced interface likely needs to support…

I had a thought that may be helpful.

I'd like to add an instance method to the HTML processor like create_fragment_parser_at_node( string $html_fragment ). This method would return a new HTML processor, created as a fragment parser with the context element created from the current tag opener (or null). This would allow use to set the document compat type and the context element and its attributes correctly according to the specification.

Once we have this method, advanced fragment parser creation could be handled by that method:

$full_parser = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><math><annotation-xml encoding="text/html">' );
$full_parser->next_tag( 'ANNOTATION-XML' );
$fragment_parser = $full_parser->create_fragment_parser_at_node( `<h1>Who knows what happens here?' );
$fragment_parser->next_tag( 'H1' );
// …

I think internally this method would be used for things like set_inner_html that require a fragment parser.

sirreal · 2024-09-13T11:17:37Z

See #7348

dmsnell mentioned this pull request Aug 5, 2024

HTML API: Use full parser for html5lib tests #7117

Closed

dmsnell added 2 commits August 5, 2024 15:41

HTML API: Allow any fragment context.

704ab8e

Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the `<body>` context. In this patch, any context node is allowed.

Update docs

691d39e

dmsnell force-pushed the html-api/allow-any-fragment-context branch from 41dc4aa to 691d39e Compare August 5, 2024 22:41

dmsnell marked this pull request as ready for review August 5, 2024 23:21

dmsnell added 2 commits August 5, 2024 22:33

Start adding tests for fragment parser

7005b6c

Add basic unit tests for SCRIPT fragment parsing.

101d345

WPCS

5583818

dmsnell mentioned this pull request Aug 7, 2024

HTML API: Plans for 6.7 WordPress/gutenberg#60396

Open

19 tasks

sirreal mentioned this pull request Aug 13, 2024

HTML API: CSS class name methods should behave according to quirks mode #7169

Closed

apermo reviewed Aug 20, 2024

View reviewed changes

dmsnell added 9 commits August 20, 2024 09:41

Merge branch 'trunk' into html-api/allow-any-fragment-context

4291703

Move context parsing into a new private static method.

d47eadd

This cleans up the actual parsing and makes it easier to accomodate self-contained context elements.

Reject creating a fragment parser for self-contained elements.

d146461

Forbit void elements.

a175677

Add TODO about foreign content.

77f1a38

Update unit tests for invalid contexts.

640f458

Remove normalization function from tests

c3f1a81

Testing: Print stuff for test failures

40f4967

Fix tests, !== ! == ===

4c652ad

dmsnell force-pushed the html-api/allow-any-fragment-context branch from 97288b3 to 4c652ad Compare August 21, 2024 01:41

dmsnell changed the title ~~HTML API: Allow any fragment context.~~ HTML API: Allow additional fragment contexts. Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Allow additional fragment contexts. #7141

HTML API: Allow additional fragment contexts. #7141

dmsnell commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024

github-actions bot commented Aug 5, 2024 •

edited

Loading

dmsnell commented Aug 6, 2024 •

edited

Loading

sirreal commented Aug 12, 2024

apermo left a comment

apermo Aug 20, 2024

apermo Aug 20, 2024

dmsnell Aug 20, 2024

apermo Aug 20, 2024

dmsnell Aug 20, 2024

apermo Aug 20, 2024 •

edited

Loading

dmsnell Aug 20, 2024

dmsnell commented Aug 21, 2024

sirreal commented Aug 22, 2024

sirreal commented Aug 26, 2024 •

edited

Loading

sirreal commented Sep 13, 2024 •

edited

Loading

		* @param string $context Context element for the fragment, must be default of `<body>`.
		* @param string $encoding Text encoding of the document; must be default of 'UTF-8'.

HTML API: Allow additional fragment contexts. #7141

Are you sure you want to change the base?

HTML API: Allow additional fragment contexts. #7141

Conversation

dmsnell commented Aug 5, 2024 • edited Loading

Status

Description

github-actions bot commented Aug 5, 2024

Test using WordPress Playground

Some things to be aware of

github-actions bot commented Aug 5, 2024 • edited Loading

dmsnell commented Aug 6, 2024 • edited Loading

sirreal commented Aug 12, 2024

apermo left a comment

Choose a reason for hiding this comment

apermo Aug 20, 2024

Choose a reason for hiding this comment

apermo Aug 20, 2024

Choose a reason for hiding this comment

dmsnell Aug 20, 2024

Choose a reason for hiding this comment

apermo Aug 20, 2024

Choose a reason for hiding this comment

dmsnell Aug 20, 2024

Choose a reason for hiding this comment

apermo Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

dmsnell Aug 20, 2024

Choose a reason for hiding this comment

dmsnell commented Aug 21, 2024

sirreal commented Aug 22, 2024

sirreal commented Aug 26, 2024 • edited Loading

sirreal commented Sep 13, 2024 • edited Loading

dmsnell commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024 •

edited

Loading

dmsnell commented Aug 6, 2024 •

edited

Loading

apermo Aug 20, 2024 •

edited

Loading

sirreal commented Aug 26, 2024 •

edited

Loading

sirreal commented Sep 13, 2024 •

edited

Loading