Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Closed
bfintal opened this issue Mar 20, 2024 · 3 comments
Closed

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

bfintal opened this issue Mar 20, 2024 · 3 comments
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] HTML API An API for updating HTML attributes in markup Needs Decision Needs a decision to be actionable or relevant [Type] Enhancement A suggestion for improvement.

Comments

@bfintal
Copy link
Contributor

bfintal commented Mar 20, 2024

What problem does this address?

With the WP_HTML_Tag_Processor, you can get an attribute, the tag name, but there is no way to get the innerHTML and outerHTML. The class is great for traversing HTML and it would be great if it can be used as an alternative to regex for grabbing html content.

Scenario: right now I'm using the render_block to grab some contents of some <style>...</style> tags via regex.

What is your proposed solution?

Add a method get_inner_html and get_outer_html that would return the inner and outer html where the current "pointer" is at.

If added, I should now be able to do:

function render_block( $html ) {
    $processor = new WP_HTML_Tag_Processor( $html );
    if ( $processor->next_tag( 'style' ) ) {
        $style = $processor->get_inner_html();
        // Do something with $style, not necessarily updating it
    }
    return $html;
}
@bfintal bfintal added the [Type] Enhancement A suggestion for improvement. label Mar 20, 2024
@nextgenthemes
Copy link

nextgenthemes commented Mar 20, 2024

Weirdly enough, the interactivity API does now has a class with methods that is build on top of the Tag Processor that has the description, seems they use it to extract the HTML. I have not looked deeply into this, but when I saw it, I thought: why not build a general purpose method right into the Tag Processor. There might be reasons, and I think there is a plan for bringing more functionality into the HTML API.

get_content_between_balanced_template_tags

Not sure what balanced means.

@jordesign jordesign added the [Feature] Block API API that allows to express the block paradigm. label Mar 20, 2024
@gziolo gziolo added the [Feature] HTML API An API for updating HTML attributes in markup label Mar 26, 2024
@gziolo
Copy link
Member

gziolo commented Mar 26, 2024

@dmsnell, can you provide the technical feedback?

@gziolo gziolo added the Needs Decision Needs a decision to be actionable or relevant label Mar 26, 2024
@dmsnell
Copy link
Member

dmsnell commented Mar 26, 2024

Thanks for the inquiry @bfintal.

If you follow the broad roadmap for the HTML API, you will note that functions like inner_html are in the plans, but we're not entirely ready for those as we don't know what interface they need, exactly.

The Interactivity API is a kind of test-bed for this work, even though hopefully in the 6.6 release cycle the custom parser will be replaced with the HTML Processor.

"Balanced" is a common idea for matching tag content. The idea is that if we assume that an HTML document always has an opening and closing tag for each element, then we can parse with a simple stack. This works reasonably well in practice, but still fails in a number of common edge cases. For example, among the web's highest-ranked pages, many closing </p> tags also implicitly close opened formatting tags like <b> and <em>. The balanced method doesn't work here.

The HTML Processor incorporates the rules in the HTML5 specification so that nobody will need to worry about when an element is opened and closed. The funny thing is that its logic ends up being much simpler than all the over-simplified attempts:

while ( $processor->next_token() && $processor->still_open( $opening_tag ) ) {
	continue;
}

This aside, there still remains open questions about how to represent inner and outer HTML relating to escaping, decoding, and composition. I encourage people to explore the existing interfaces and to share feedback in #core-html-api, but please be warned against building structural parsers for production: it's almost impossible to know what is and isn't inner HTML without implementing the semantic rules of HTML5.

Scenario: right now I'm using the render_block to grab some contents of some <style>...</style> tags via regex.

Good news! in WordPress 6.5 this is even easier, because the introduction of the $processor->next_token() function makes it easier and safer to read the contents of a SCRIPT element. Both SCRIPT and STYLE (and TITLE and TEXTAREA) are special elements wherein they only contain plaintext; they cannot contain markup. That means if you find <img> inside of them a browser would treat that as the text <img> and display it as text, not as a tag. In order to guard against accidentally treating those contents as HTML, the Tag Processor exposes $processor->get_modifiable_text() and properly decodes the contents (because some are supposed to decode HTML character references like &colon; while others aren't supposed to).

while ( $processor->next_tag( 'STYLE' ) ) {
	$contents = $processor->get_modifiable_text();
	analyze_style( $contents );
}

Unfortunately there's no support yet for modifying the modifiable text. If you want to do that, come join us in Slack and we can discuss how to do it, or link to a PR in your project and I'd be happy to review.

I'm going to close this issue because: we already plan on adding inner/outer HTML support, but not yet; and HTML API development is tracking in the linked discussion and on Core Trac. Feel free to continue responding.

@dmsnell dmsnell closed this as completed Mar 26, 2024
@dmsnell dmsnell mentioned this issue Apr 2, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. [Feature] HTML API An API for updating HTML attributes in markup Needs Decision Needs a decision to be actionable or relevant [Type] Enhancement A suggestion for improvement.
Projects
None yet
Development

No branches or pull requests

5 participants