Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

bfintal · 2024-03-20T15:49:09Z

What problem does this address?

With the WP_HTML_Tag_Processor, you can get an attribute, the tag name, but there is no way to get the innerHTML and outerHTML. The class is great for traversing HTML and it would be great if it can be used as an alternative to regex for grabbing html content.

Scenario: right now I'm using the render_block to grab some contents of some <style>...</style> tags via regex.

What is your proposed solution?

Add a method get_inner_html and get_outer_html that would return the inner and outer html where the current "pointer" is at.

If added, I should now be able to do:

function render_block( $html ) {
    $processor = new WP_HTML_Tag_Processor( $html );
    if ( $processor->next_tag( 'style' ) ) {
        $style = $processor->get_inner_html();
        // Do something with $style, not necessarily updating it
    }
    return $html;
}

The text was updated successfully, but these errors were encountered:

nextgenthemes · 2024-03-20T21:57:01Z

Weirdly enough, the interactivity API does now has a class with methods that is build on top of the Tag Processor that has the description, seems they use it to extract the HTML. I have not looked deeply into this, but when I saw it, I thought: why not build a general purpose method right into the Tag Processor. There might be reasons, and I think there is a plan for bringing more functionality into the HTML API.

get_content_between_balanced_template_tags

Not sure what balanced means.

gziolo · 2024-03-26T06:44:11Z

@dmsnell, can you provide the technical feedback?

dmsnell · 2024-03-26T11:24:41Z

Thanks for the inquiry @bfintal.

If you follow the broad roadmap for the HTML API, you will note that functions like inner_html are in the plans, but we're not entirely ready for those as we don't know what interface they need, exactly.

The Interactivity API is a kind of test-bed for this work, even though hopefully in the 6.6 release cycle the custom parser will be replaced with the HTML Processor.

"Balanced" is a common idea for matching tag content. The idea is that if we assume that an HTML document always has an opening and closing tag for each element, then we can parse with a simple stack. This works reasonably well in practice, but still fails in a number of common edge cases. For example, among the web's highest-ranked pages, many closing </p> tags also implicitly close opened formatting tags like <b> and <em>. The balanced method doesn't work here.

The HTML Processor incorporates the rules in the HTML5 specification so that nobody will need to worry about when an element is opened and closed. The funny thing is that its logic ends up being much simpler than all the over-simplified attempts:

while ( $processor->next_token() && $processor->still_open( $opening_tag ) ) {
	continue;
}

This aside, there still remains open questions about how to represent inner and outer HTML relating to escaping, decoding, and composition. I encourage people to explore the existing interfaces and to share feedback in #core-html-api, but please be warned against building structural parsers for production: it's almost impossible to know what is and isn't inner HTML without implementing the semantic rules of HTML5.

Scenario: right now I'm using the render_block to grab some contents of some <style>...</style> tags via regex.

Good news! in WordPress 6.5 this is even easier, because the introduction of the $processor->next_token() function makes it easier and safer to read the contents of a SCRIPT element. Both SCRIPT and STYLE (and TITLE and TEXTAREA) are special elements wherein they only contain plaintext; they cannot contain markup. That means if you find <img> inside of them a browser would treat that as the text <img> and display it as text, not as a tag. In order to guard against accidentally treating those contents as HTML, the Tag Processor exposes $processor->get_modifiable_text() and properly decodes the contents (because some are supposed to decode HTML character references like &colon; while others aren't supposed to).

while ( $processor->next_tag( 'STYLE' ) ) {
	$contents = $processor->get_modifiable_text();
	analyze_style( $contents );
}

Unfortunately there's no support yet for modifying the modifiable text. If you want to do that, come join us in Slack and we can discuss how to do it, or link to a PR in your project and I'd be happy to review.

I'm going to close this issue because: we already plan on adding inner/outer HTML support, but not yet; and HTML API development is tracking in the linked discussion and on Core Trac. Feel free to continue responding.

bfintal added the [Type] Enhancement A suggestion for improvement. label Mar 20, 2024

jordesign added the [Feature] Block API API that allows to express the block paradigm. label Mar 20, 2024

gziolo added the [Feature] HTML API An API for updating HTML attributes in markup label Mar 26, 2024

gziolo added the Needs Decision Needs a decision to be actionable or relevant label Mar 26, 2024

dmsnell closed this as completed Mar 26, 2024

dmsnell mentioned this issue Apr 2, 2024

HTML API: Roadmap #60397

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

bfintal commented Mar 20, 2024 •

edited

Loading

nextgenthemes commented Mar 20, 2024 •

edited

Loading

gziolo commented Mar 26, 2024

dmsnell commented Mar 26, 2024

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Comments

bfintal commented Mar 20, 2024 • edited Loading

What problem does this address?

What is your proposed solution?

nextgenthemes commented Mar 20, 2024 • edited Loading

gziolo commented Mar 26, 2024

dmsnell commented Mar 26, 2024

bfintal commented Mar 20, 2024 •

edited

Loading

nextgenthemes commented Mar 20, 2024 •

edited

Loading