HTML API: Fix splitting single text node #5976

sirreal · 2024-01-30T17:28:45Z

Fix an issue where a < character will always break a text node.

It should not break a text node if we will not start another node type, i.e. we find < followed by !, /, ? or an ascii alpha character.

Trac ticket: https://core.trac.wordpress.org/ticket/60385

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

github-actions · 2024-01-30T17:48:18Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

sirreal · 2024-01-30T18:02:43Z

@dmsnell Please take a look at this.

sirreal · 2024-01-30T18:09:00Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+						'!' === $next_character ||
+						'/' === $next_character ||
+						'?' === $next_character ||
+						( 'A' <= $next_character && $next_character <= 'z' );


I don't love this, I think it's doing character code comparison but I couldn't find conclusive docs.

I considered ctype_alpha but apparently that's locale dependent so best avoided. We want ascii alpha.

We could also apply ord and compare (65 <= ord( $char ) <= 122) and comment the magic numbers.

based on the PHP docs for the comparison operators I think this is valid as-is, as long as we're comparing strings to strings

dmsnell

Good catch @sirreal. While this shouldn't have caused any problems in an application, it's more correct combining these. I wanted to re-examine the rules and fallback to the existing "failure to find a tag" but I think having this up-front has its merit.

Did you examine trapping this inside of next_tag() in the case that parse_next_tag() returns false? I'm a bit surprised that this wasn't behaving differently, but I guess that loop runs to the end.

My one big remaining question is whether we expect this to be a normal case or a rare case. That maybe would suggest whether to eagerly re-compute the next character, as we do in this patch, or put it at the end of the loop in parse_next_tag(). We can adjust as we see fit.

dmsnell · 2024-01-30T22:09:44Z

Merged in [57489]
0b800d7

sirreal · 2024-01-31T09:06:23Z

I wanted to re-examine the rules and fallback to the existing "failure to find a tag" but I think having this up-front has its merit.

My first thought was that we should find the text node once we've found the start of the next node, but that seemed like a much larger refactor. What I don't like about this approach is that we're effectively identifying the start of the next node in multiple places.

I think it's fine for now, this is a smaller change with the 6.5 release on the horizon and as test coverage increases we can explore larger changes before the next release.

Did you examine trapping this inside of next_tag() in the case that parse_next_tag() returns false?

I did not.

My one big remaining question is whether we expect this to be a normal case or a rare case.

I don't have data to answer that. My gut suggests this is unusual, but I have no idea.

That maybe would suggest whether to eagerly re-compute the next character, as we do in this patch, or put it at the end of the loop in parse_next_tag(). We can adjust as we see fit.

I think you're describing the approach I'd like, but again I think that would be a larger refactor and perhaps best left for the next release cycle.

Updates from WordPress/wordpress-develop: - From: WordPress/wordpress-develop@54a09a7 - To: WordPress/wordpress-develop@7a71339 - Coding style changes. - WordPress/wordpress-develop#5762 Adds support for the "any other tag" sections in the HTML Processor. - WordPress/wordpress-develop#5539 Adds support for list elements in the HTML Processor. - WordPress/wordpress-develop#5897 Adds support for HR elements in the HTML Processor. - WordPress/wordpress-develop#5895 Adds support for the AREA, BR, EMBED, KEYGEN, and WBR elements in the HTML Processor. - WordPress/wordpress-develop#5903 Adds support for the PRE and LISTING elements in the HTML Processor. - WordPress/wordpress-develop#5913 Updates "all other tags" support in HTML Processor and updates list of void elements. - WordPress/wordpress-develop#5906 Adds support for the PARAM, SOURCE, and TRACK elements in the HTML Processor. - WordPress/wordpress-develop#5907 Adds support for the INPUT element in the HTML Processor - WordPress/wordpress-develop#5683 Provides mechanism to scan all tokens in an HTML document in the Tag Processor. - WordPress/wordpress-develop#5976 Avoids splitting text nodes on "<" character. - WordPress/wordpress-develop#5992 Only recognize true CDATA-lookalike nodes. - WordPress/wordpress-develop#5975 Prevent void tag nesting when calling `next_token()` - WordPress/wordpress-develop#6021 Reset parser state after seeking. - https://core.trac.wordpress.org/changeset/57528 Fix typo in setting token flag. - WordPress/wordpress-develop#6041 Ensure consecutive text is all joined into one text node. The PHP files in the compatability layer are merged and maintained in the Core repo and all changes or updates need to happen first in Core and then be brought over to Gutenberg as built files. Co-authored-by: sergeybiryukov <sergeybiryukov@git.wordpress.org> Co-authored-by: sirreal <jonsurrell@git.wordpress.org> Co-authored-by: dmsnell <dmsnell@git.wordpress.org>

Add failing test

74651ec

sirreal added 2 commits January 30, 2024 18:58

Propose fix

8116756

Lints

3cda07b

sirreal marked this pull request as ready for review January 30, 2024 18:02

A yoda condition is satisfied

ef25fff

sirreal commented Jan 30, 2024

View reviewed changes

dmsnell approved these changes Jan 30, 2024

View reviewed changes

Add short description for test behavior.

986cc39

dmsnell closed this Jan 30, 2024

dmsnell deleted the html-api/fix-breaking-adjacent-text-nodes branch January 30, 2024 22:17

sirreal mentioned this pull request Jan 31, 2024

HTML API: Add test suite from html5lib #5794

Closed

3 tasks

dmsnell mentioned this pull request Feb 6, 2024

HTML API: Backport updates from Core WordPress/gutenberg#58107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Fix splitting single text node #5976

HTML API: Fix splitting single text node #5976

sirreal commented Jan 30, 2024 •

edited

Loading

github-actions bot commented Jan 30, 2024

sirreal commented Jan 30, 2024

sirreal Jan 30, 2024 •

edited

Loading

dmsnell Jan 30, 2024

dmsnell left a comment

dmsnell commented Jan 30, 2024

sirreal commented Jan 31, 2024

HTML API: Fix splitting single text node #5976

HTML API: Fix splitting single text node #5976

Conversation

sirreal commented Jan 30, 2024 • edited Loading

github-actions bot commented Jan 30, 2024

Test using WordPress Playground

Some things to be aware of

sirreal commented Jan 30, 2024

sirreal Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

dmsnell Jan 30, 2024

Choose a reason for hiding this comment

dmsnell left a comment

Choose a reason for hiding this comment

dmsnell commented Jan 30, 2024

sirreal commented Jan 31, 2024

sirreal commented Jan 30, 2024 •

edited

Loading

sirreal Jan 30, 2024 •

edited

Loading