Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Allow additional fragment contexts. #7141

Open
wants to merge 14 commits into
base: trunk
Choose a base branch
from
23 changes: 20 additions & 3 deletions src/wp-includes/html-api/class-wp-html-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -281,24 +281,41 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
*
* ## Current HTML Support
*
* - The only supported context is `<body>`, which is the default value.
* - The only supported document encoding is `UTF-8`, which is the default value.
*
* @todo Verify that creating a fragment in self-contained elements works.
*
* @since 6.4.0
* @since 6.6.0 Returns `static` instead of `self` so it can create subclass instances.
* @since 6.7.0 Can create fragment in any context.
*
* @param string $html Input HTML fragment to process.
* @param string $context Context element for the fragment, must be default of `<body>`.
* @param string $encoding Text encoding of the document; must be default of 'UTF-8'.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must be default of ..."?
My internal autocomplete expected an "or" here. Which likely is equivalent, but I was expecting it.

And I think at least for $context you need to update it, since your change was about allowing other than body.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for noticing.

must be default of UTF-8 and must be default or UTF-8 are very different, whereas only the default value has been allowed (and the default is UTF-8). some day we might open it up to other values, but this is there to communicate intentionally that this is a UTF-8-only interface at the moment.

* @return static|null The created processor if successful, otherwise null.
*/
public static function create_fragment( $html, $context = '<body>', $encoding = 'UTF-8' ) {
if ( '<body>' !== $context || 'UTF-8' !== $encoding ) {
if ( 'UTF-8' !== $encoding ) {
return null;
}

$context_processor = new WP_HTML_Tag_Processor( $context );
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how deep you want to comment, but a comment here could help understanding what you were up to.

if ( ! $context_processor->next_token() || '#tag' !== $context_processor->get_token_type() ) {
return null;
}

$context_tag = $context_processor->get_tag();
$context_attributes = array();
foreach ( $context_processor->get_attribute_names_with_prefix( '' ) as $name ) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an empty prefix here looks strange.
If it was my code I'd either have the prefix optional/default to empty or add a get_attributes_names() which would do exactly the same as get_attributes_names_with_prefix( '' ), might be redundant in a way, but for me it would look a little more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @apermo. if this looks strange, it's supposed to 🙃
when we added this method we wanted it to be a bit awkward in the hopes of communicating that it involves additional costs. we wanted the default behavior to be looking for a subset of attributes (such as get_attribute_names_with_prefix( 'data-' )) so that the API itself guides people to learn in the safest most performant manner.

$context_attributes[ $name ] = $context_processor->get_attribute( $name );
}

if ( $context_processor->next_token() ) {
return null;
}

$processor = new static( $html, self::CONSTRUCTOR_UNLOCK_CODE );
$processor->state->context_node = array( 'BODY', array() );
$processor->state->context_node = array( $context_tag, $context_attributes );
$processor->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY;
Copy link

@apermo apermo Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood your PR correctly, it's about allowing insertion to anything other than body, or is body in this case ambivalent?

So content body vs <body>?

Anyways, I'm uncertain wether this is intentional or if you forgot to touch this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fragment parser is synonymous with node.innerHTML = newHtml in JavaScript or with a DOM. in this case, <body> has been the default as a reasonably safe default for existing code, which likely is getting only a small chunk of the actual site HTML and then trying to process it.

so if you knew you were inside a <li> you would create the fragment parser in the <li> context and then the parse would change based on that. I probably don't fully understand this, because it's hard for me to come up with situations that lead to different parses, but it comes into play when inside SVGs or MathML elements, and when resetting some internals (the insertion mode).

basically this is not something most people will need to use, but it will be used by set_inner_html() to ensure appropriate parsing.

$processor->state->encoding = $encoding;
$processor->state->encoding_confidence = 'certain';
Expand Down
114 changes: 114 additions & 0 deletions tests/phpunit/tests/html-api/wpHtmlProcessorFragmentParsing.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
<?php
/**
* Unit tests covering WP_HTML_Processor fragment parsing functionality.
*
* @package WordPress
* @subpackage HTML-API
*
* @since 6.7.0
*
* @group html-api
*
* @coversDefaultClass WP_HTML_Processor
*/
class Tests_HtmlApi_WpHtmlProcessorFragmentParsing extends WP_UnitTestCase {
/**
* Verifies that SCRIPT fragment parses behave as they should.
*
* @dataProvider data_script_fragments
*
* @param string $inner_html HTML to parse in SCRIPT fragment.
* @param string|null $expected_html Expected output of the parse, or `null` if unsupported.
*/
public function test_script_tag( string $inner_html, ?string $expected_html ) {
$processor = WP_HTML_Processor::create_fragment( $inner_html, '<script></script>' );
$normalized = static::normalize_html( $processor );

if ( isset( $expected_html ) ) {
$this->assertSame(
$expected_html,
$normalized,
'Failed to properly parse SCRIPT fragment.'
);
} else {
$this->assertNull(
$normalized,
"Should have bailed when parsing but didn't."
);
}
}

/**
* Data provider.
*
* @ticket 61576
*
* @return array[]
*/
public static function data_script_fragments() {
return array(
'Basic SCRIPT' => array( 'const x = 5 < y;', 'const x = 5 < y;' ),
'Text after SCRIPT' => array( 'const x = 5 < y;</script>test', null ),
'Tag after SCRIPT' => array( 'end</script><img>', null ),
'Double escape' => array( "<!--<script>\nconsole.log('</script>');\n-->\nconsole.log('<img>');", "<!--<script>\nconsole.log('\</script>');\n-->\nconsole.log('<img'>);" ),
);
}

/**
* Produces normalized HTML output given a processor as input, which has not
* yet started to proceed through its document.
*
* This can be used with a full or a fragment parser.
*
* @param WP_HTML_Processor $processor HTML Processor in READY state at the beginning of its input.
* @return string|null Normalized HTML from input processor.
*/
private static function normalize_html( WP_HTML_Processor $processor ): ?string {
$html = '';

while ( $processor->next_token() ) {
$token_name = $processor->get_token_name();
$token_type = $processor->get_token_type();
$is_closer = $processor->is_tag_closer();

switch ( $token_type ) {
case '#text':
$html .= $processor->get_modifiable_text();
break;

case '#tag':
if ( $is_closer ) {
$html .= "</{$token_name}>";
} else {
$names = $processor->get_attribute_names_with_prefix( '' );
if ( ! isset( $names ) ) {
$html .= "<{$token_name}>";
} else {
$html .= "<{$token_name}";
foreach ( $names as $name ) {
$value = $processor->get_attribute( $name );
if ( true === $value ) {
$html .= " {$name}";
} else {
$value = strtr( $value, '"', '&quot;' );
$html .= " {$name}=\"{$value}\"";
}
}
}

$text = $processor->get_modifiable_text();
if ( '' !== $text ) {
$html .= "{$text}</{$token_name}>";
}
}
break;
}
}

if ( null !== $processor->get_last_error() ) {
return null;
}

return $html;
}
}
Loading