Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Complete missing tags in the IN BODY insertion mode. #6972

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 5, 2024

Trac ticket: Core-61576

Summary

  • There are some failing tests that need to pass.
  • This needs a full and careful audit.
  • Full docblock/example review (probably nothing more is needed since everything should be internal or already-documented)

Description

As part of work to add more spec support to the HTML API, this patch adds support for the remaining missing tags in the IN BODY insertion mode. Not all of the added tags are supported, because in some cases they reset the insertion mode and are reprocessed where they will be rejected.

Included in these changes are some shuffling of the tag rules in IN BODY. This is to make it easier to compare the final code against the HTML specification, so that they appear in the same order as they exist within the spec.

Testing

This change unlocks 54 new test assertions that have previously been skipped.

html5lib tests

- Tests: 614, Assertions: 191, Skipped: 423.
+ Tests: 610, Assertions: 275, Skipped: 335.

Copy link

github-actions bot commented Jul 5, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

? ( WP_HTML_Stack_Event::POP === $this->current_element->operation && '#tag' === $this->get_token_type() )
: parent::is_tag_closer();

// A closing BR tag is treated as if it were a opening tag.
return $is_closer_token && 'BR' !== $this->get_tag();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this belong in the Tag Processor? I don't know. It in some ways is more accurate there, but then it's not possible to distinguish an opening from closing BR tag. Probably it should move into the Tag Processor since it's never possible in HTML to have a closing BR tag.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving notes from a partial review.

*
* @var string
*/
const QUIRKS_MODE = 'quirks-mode';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. There are 3 document modes:

  • "no-quirks"
  • "quirks"
  • "limited-quirks"

We're missing "limited-quirks".

However, the compatMode of a document is "BackCompat" if the document is in quirks mode or "CSS1Compat" if the document is in "no-quirks" or "limited-quirks" modes.

Do you know whether we ignore the difference between no- and limited- quirks modes? Is the plan to add the additional mode when and if it becomes necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I haven't seen any implications for limited-quirks mode. Most language reads like "if document is in quirks mode".

I'll double-check, but I think limited quirks comes specifically in the fragment case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after another scan through the spec, it does seem that limited-quirks mode behaves just like no-quirks mode, and the only impact this has is on case-sensitivity in matching IDs and class names.

I don't see a reason to differentiate them, especially since browsers don't. the name CSS1Compat makes more sense to me now too, since it's really about how CSS selectors behave

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm satisfied 👌

*
* This parser does not currently support this behavior.
*
* @todo The spec is unclear on whether this should be ignored. Should it?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems clear that the only thing to do with this is to add the attributes that aren't already present. Is your question more about whether we should pause here or skip it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's about pausing or skipping. I think the only acceptable answer is to skip, but I don't want to assume that without questioning it. if we don't skip it, we'll reach a new BODY opening without a closer, I think.

Comment on lines 104 to 129
/**
* Returns the name of the node at the nth position on the stack
* of open elements, or `null` if no such position exists.
*
* Note that this uses a 1-based index, which represents the
* "nth item" on the stack, counting from the top, where the
* top-most element is the 1st, the second is the 2nd, etc...
*
* @todo Skip over "marker" entries on the stack, if appropriate.
*
* @param int $nth Retrieve the nth item on the stack, with 1 being
* the top element, 2 being the second, etc...
* @return string|null Name of the node on the stack at the given location,
* or `null` if the location isn't on the stack.
*/
public function at( int $nth ) {
foreach ( $this->walk_down() as $item ) {
if ( 0 === --$nth ) {
return $item->node_name;
}
}

return null;
}
Copy link
Member

@sirreal sirreal Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice method, I'd like to be able to use it in both directions by providing a negative index to walk_up and a positive index to walk_down.

The negative index would make this case much easier:

$current_node = $this->state->stack_of_open_elements->current_node();
if ( $current_node && 'OPTION' === $current_node->node_name ) {
foreach ( $this->state->stack_of_open_elements->walk_up( $current_node ) as $parent ) {
break;
}
if ( $parent && 'OPTGROUP' === $parent->node_name ) {
$this->state->stack_of_open_elements->pop();
}
}

if (
  $this->state->stack_of_open_elements->current_node_is( 'OPTION' ) &&
  'OPTGROUP' === $this->state->stack_of_open_elements->at( -2 )
) {
  // pop
}

This is how Array.prototype.at works in JS.

Suggested change
/**
* Returns the name of the node at the nth position on the stack
* of open elements, or `null` if no such position exists.
*
* Note that this uses a 1-based index, which represents the
* "nth item" on the stack, counting from the top, where the
* top-most element is the 1st, the second is the 2nd, etc...
*
* @todo Skip over "marker" entries on the stack, if appropriate.
*
* @param int $nth Retrieve the nth item on the stack, with 1 being
* the top element, 2 being the second, etc...
* @return string|null Name of the node on the stack at the given location,
* or `null` if the location isn't on the stack.
*/
public function at( int $nth ) {
foreach ( $this->walk_down() as $item ) {
if ( 0 === --$nth ) {
return $item->node_name;
}
}
return null;
}
/**
* Returns the name of the node at the nth position on the stack
* of open elements, or `null` if no such position exists.
*
* Note that this uses a 1-based index, which represents the
* "nth item" on the stack, counting from the top, where the
* top-most element is the 1st, the second is the 2nd, etc...
*
* A negative index can be used to access the element beginning
* at the bottom of the stack, where the bottom-most element is
* `at( -1 )`.
*
* @todo Skip over "marker" entries on the stack, if appropriate.
*
* @since 6.7.0
*
* @param int $nth Retrieve the nth item on the stack, with 1 being
* the top element, 2 being the second, etc...
* If a negative index is provided, the lookup will
* start from the bottom of the stack.
* @return string|null Name of the node on the stack at the given location,
* or `null` if the location isn't on the stack.
*/
public function at( int $nth ) {
if ( 0 === $nth ) {
return null; // _doing_it_wrong here?
}
if ( 0 > $nth ) {
foreach ( $this->walk_up() as $item ) {
if ( 0 === ++$nth ) {
return $item->node_name;
}
}
return null;
}
foreach ( $this->walk_down() as $item ) {
if ( 0 === --$nth ) {
return $item->node_name;
}
}
return null;
}

I also added handling for $nth === 0 which seems good. There shouldn't be risk of an infinite loop, but there's no reason to do any iteration with 0.


Aside observation

I spent a bit of time reviewing some language which sounded off to me. It seems like the HTML standard inverts conventional stack terminology. The stack of open elements is a LIFO stack, but it starts at the top and pushes to and pops from the bottom, so the terminology used in the HTML API aligns with that:

The html node, however it is created, is the topmost node of the stack. It only gets popped off the stack when the parser finishes.

The current node is the bottommost node in this stack of open elements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to be careful about conflating the language from other programming languages and their array data structures and the stack in the HTML spec. maybe at() is the wrong name. I originally added this before when I needed to implement another behavior.

perhaps we call it stack_position( $nth )

I like your idea for negative numbers, though maybe we could make it even more explicit.

'OPTGROUP' === $this->state->stack_of_open_elements->at_position( 'from-bottom', 2 );

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true this could be confusing to use a single method and negative or positive indexes to move from either end, it may not be clear which end is which.

Instead of a single method with some better-named flag, let's match the walk_up (start at bottom and go towards top) and walk_down (start at top and go towards bottom) which already exist and match the specification stack:

'DIV' === $this->state->stack_of_open_elements->nth_from_bottom( 1 ); // analogous to current node on the stack
'HTML' === $this->state->stack_of_open_elements->nth_from_top( 1 ); // always true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, but I think it would be best on its own, so maybe we can follow-up with a change that modifies things in a unified commit. with the other major open PRs, perhaps we can address this once we are basically at full support, or at least until the open PRs are closed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch that. I was playing with it and I'm going to incorporate it here. let's see how we go. there's a part of me that doesn't like adding more index-based methods (as I write it out), but given the spec uses this language, that's fine I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels odd, so I'm going to leave this out for now and ask you your thoughts. For most cases of walk_up() and walk_down() we're referencing another node and not an index, which seems far less likely to be broken by updates to the code. For example, setting $parent = $->nth_from_bottom(2) depends on the current node being at position nth = 1. With ->walk_up( $node ) as $parent we don't have that same fragility by exporting the internals of the stack to the HTML Processor.

I'm open to this, but the gains didn't feel as big as I expected when I went through what you proposed and did my own search to find all the cases.

diff --git a/src/wp-includes/html-api/class-wp-html-open-elements.php b/src/wp-includes/html-api/class-wp-html-open-elements.php
index 0cd223657a..6f76265b09 100644
--- a/src/wp-includes/html-api/class-wp-html-open-elements.php
+++ b/src/wp-includes/html-api/class-wp-html-open-elements.php
@@ -109,8 +109,6 @@ class WP_HTML_Open_Elements {
 	 * "nth item" on the stack, counting from the top, where the
 	 * top-most element is the 1st, the second is the 2nd, etc...
 	 *
-	 * @todo Skip over "marker" entries on the stack, if appropriate.
-	 *
 	 * @since 6.7.0
 	 *
 	 * @param int $nth Retrieve the nth item on the stack, with 1 being
@@ -128,6 +126,53 @@ class WP_HTML_Open_Elements {
 		return null;
 	}
 
+	/**
+	 * Returns the name of the node at the nth position from the bottom on
+	 * the stack of open elements, or `null` if no such position exists.
+	 *
+	 * Note that this uses a 1-based index, which represents the
+	 * "nth item" on the stack, counting from the bottom, where the
+	 * bottom-most (or deepest) element is the 1st, the second is
+	 * the 2nd, etc...
+	 *
+	 * @since 6.7.0
+	 *
+	 * @param int $nth Retrieve the nth item on the stack, with 1 being
+	 *                 the bottom (depeest) element, 2 being the second, etc...
+	 * @return string|null Name of the node on the stack at the given location,
+	 *                     or `null` if the location isn't on the stack.
+	 */
+	public function nth_from_bottom( int $nth ): ?string {
+		if ( $nth <= 0 || $nth > count( $this->stack ) ) {
+			return null;
+		}
+
+		return $this->stack[ count( $this->stack ) - $nth ]->node_name;
+	}
+
+	/**
+	 * Returns the name of the node at the nth position from the top on
+	 * the stack of open elements, or `null` if no such position exists.
+	 *
+	 * Note that this uses a 1-based index, which represents the
+	 * "nth item" on the stack, counting from the top, where the
+	 * top-most element is the 1st, the second is the 2nd, etc...
+	 *
+	 * @since 6.7.0
+	 *
+	 * @param int $nth Retrieve the nth item on the stack, with 1 being
+	 *                 the top element, 2 being the second, etc...
+	 * @return string|null Name of the node on the stack at the given location,
+	 *                     or `null` if the location isn't on the stack.
+	 */
+	public function nth_from_top( int $nth ): ?string {
+		if ( $nth <= 0 || $nth > count( $this->stack ) ) {
+			return null;
+		}
+
+		return $this->stack[ $nth - 1 ]->node_name;
+	}
+
 	/**
 	 * Reports if a node of a given name is in the stack of open elements.
 	 *
diff --git a/src/wp-includes/html-api/class-wp-html-processor.php b/src/wp-includes/html-api/class-wp-html-processor.php
index f66498b19f..0f2135387e 100644
--- a/src/wp-includes/html-api/class-wp-html-processor.php
+++ b/src/wp-includes/html-api/class-wp-html-processor.php
@@ -1211,7 +1211,7 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 			case '+BODY':
 				if (
 					1 === $this->state->stack_of_open_elements->count() ||
-					'BODY' !== $this->state->stack_of_open_elements->at( 2 ) ||
+					'BODY' !== $this->state->stack_of_open_elements->nth_from_top( 2 ) ||
 					$this->state->stack_of_open_elements->contains( 'TEMPLATE' )
 				) {
 					// Ignore the token.
@@ -1237,7 +1237,7 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 			case '+FRAMESET':
 				if (
 					1 === $this->state->stack_of_open_elements->count() ||
-					'BODY' !== $this->state->stack_of_open_elements->at( 2 ) ||
+					'BODY' !== $this->state->stack_of_open_elements->nth_from_top( 2 ) ||
 					false === $this->state->frameset_ok
 				) {
 					// Ignore the token.
@@ -2358,10 +2358,8 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 			case '-OPTGROUP':
 				$current_node = $this->state->stack_of_open_elements->current_node();
 				if ( $current_node && 'OPTION' === $current_node->node_name ) {
-					foreach ( $this->state->stack_of_open_elements->walk_up( $current_node ) as $parent ) {
-						break;
-					}
-					if ( $parent && 'OPTGROUP' === $parent->node_name ) {
+					$parent = $this->state->stack_of_open_elements->nth_from_bottom( 2 );
+					if ( 'OPTGROUP' === $parent ) {
 						$this->state->stack_of_open_elements->pop();
 					}
 				}
@@ -3338,10 +3336,7 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 	 */
 	public function reset_insertion_mode(): void {
 		// Set the first node.
-		$first_node = null;
-		foreach ( $this->state->stack_of_open_elements->walk_down() as $first_node ) {
-			break;
-		}
+		$first_node = $this->state->stack_of_open_elements->nth_from_top( 1 );
 
 		/*
 		 * > 1. Let _last_ be false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely, with that patch applied I get one more skipped test/one fewer assertions. I'm not sure why, but I'm thinking it must be a logic error, which only adds to my skepticism of replacing semantic relations with array indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it out for now and revisit after this lands, that's fine.

*
* This parser does not currently support this behavior.
*
* @todo The spec is unclear on whether this should be ignored. Should it?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever we do we should be consistent with how we handle HTML above.

My inclination is that we should not stop. It would be good to handle the attributes, but in the interest of providing developers an interface similar to what they'd expect in a DOM tree, it doesn't make sense for these nodes to appear where they're written in HTML because they'll never actually exist there. There's a possible side-effect of them being present, but side-effects are not a reason to stop at next_tag or next_token.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think this is the only viable option.

originally some of this was held up by the need to communicate these changes. someone could have been looking for a BODY element with data-is-dark on it, and skipped the BODY because at the time it didn't have that attribute, but only once parsing the full document do we realize it had it.

I'd still like to handle this by communicating the retroactive document change (a need for adoption and fostering as well), but for now it's probably good enough simply to list it as a limit of this parser.

Comment on lines +1263 to +1445
* @todo This may need to be handled in the Tag Processor and turn into
* a single self-contained tag like TEXTAREA, whose modifiable text
* is the rest of the input document as plaintext.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea, I was wondering how to handle this. We can figure that out in a follow-up.

Comment on lines +1326 to +1507
if ( ! $this->state->stack_of_open_elements->current_node_is( $token_name ) ) {
// @todo Record parse error: this error doesn't impact parsing.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could turn this into a comment like we've done elsewhere, there's no reason to check if we're not reporting errors.

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, I'll continue.

Comment on lines +1375 to +1557
if ( ! $this->state->stack_of_open_elements->current_node_is( 'FORM' ) ) {
// @todo Indicate a parse error once it's possible. This error does not impact the logic here.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-op parse error, this can be a comment.

Comment on lines 1459 to 1636
if ( ! $this->state->stack_of_open_elements->current_node_is( $token_name ) ) {
// @todo Record parse error: this error doesn't impact parsing.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noop parse error, comment instead.

src/wp-includes/html-api/class-wp-html-processor.php Outdated Show resolved Hide resolved
tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php Outdated Show resolved Hide resolved
return $text;
}

if ( '#comment' === $this->get_token_name() ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funky comments probably need this handling too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe? I want to audit this because the rules are all different for various contexts and insertion modes, mainly whether null bytes transform into U+FFFD or whether it's stripped away. I'll follow-up with this, but I don't feel like it has to be in this patch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to handle all situations, which I think is right now.

I also moved the logic into the Tag Processor, because it seems like it's capable enough to know whether to skip the newline. Turns out that if it's anything other than a text node immediately following a PRE or LISTING (since TEXTAREA is self-contained, the logic there is clear), then there's no skipping. A simple flag WP_HTML_Tag_Processor::skip_next_line_feed suffices.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a lot of comments, the majority of this is ready to go once you've considered them. I won't be able to review again soon, so I'm leaving a preemptive approval.

@dmsnell dmsnell force-pushed the html-api/finish-in-body-insertion-mode branch 2 times, most recently from 1e13e07 to 8c591bc Compare July 6, 2024 23:45
src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
@@ -63,7 +63,7 @@ class WP_HTML_Token {
/**
* Called when token is garbage-collected or otherwise destroyed.
*
* @var callable|null
* @var ?callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently PHPDoc doesn't actually support this nullable syntax, although PHP itself does.

https://stackoverflow.com/a/50704179/93579

Same goes for below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging, [58742] was landed to fix PHPDoc annotations in many places.

sirreal added a commit to sirreal/wordpress-develop that referenced this pull request Jul 17, 2024
sirreal added a commit to sirreal/wordpress-develop that referenced this pull request Jul 17, 2024
sirreal added a commit to sirreal/wordpress-develop that referenced this pull request Jul 17, 2024
This includes template support in body
@dmsnell dmsnell marked this pull request as ready for review July 19, 2024 17:35
Copy link

github-actions bot commented Jul 19, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell dmsnell force-pushed the html-api/finish-in-body-insertion-mode branch 3 times, most recently from 85dd560 to 90abdee Compare July 20, 2024 00:07
dmsnell and others added 5 commits July 19, 2024 17:09
As part of work to add more spec support to the HTML API, this patch adds
support for the remaining missing tags in the IN BODY insertion mode. Not
all of the added tags are supported, because in some cases they reset the
insertion mode and are reprocessed where they will be rejected.

See Core-61576.
Co-authored-by: Weston Ruter <westonruter@google.com>
@dmsnell
Copy link
Member Author

dmsnell commented Jul 21, 2024

@westonruter @sirreal I think all of the feedback has been addressed. additionally:

  • get_modifiable_text() in the Tag Processor has been updated to properly handle leading newlines and NULL bytes.
  • an additional test asserts this behavior for get_modifiable_text().
  • I've pulled in clear_up_to_last_marker() from HTML API: Add table support #6040 (thanks @sirreal).
  • Fixed some mistakes/oversights after additional review.

I might merge this tomorrow, but I'd appreciate any reviews in post. This is so exciting: we're there - at the end of IN BODY.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the get_modifiable_text fixes for null characters and omitted newlines.

If we pull that out, I think everything else is ready in this PR and that change can be done in a follow-up.

src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
src/wp-includes/html-api/class-wp-html-open-elements.php Outdated Show resolved Hide resolved
*
* @var string
*/
const QUIRKS_MODE = 'quirks-mode';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm satisfied 👌

*
* @var string
*/
public $compat_mode = self::NO_QUIRKS_MODE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode has potential to be confusing. I suspect the processor will almost always be used in "no-quirks" mode. I do think there's value in making sure these things are clear. I'm a bit concerned we're mixing mode/compatMode and their respective values.


The HTML standard refers to a mode associated with every document:

Each document has an associated… mode ("no-quirks", "quirks", or "limited-quirks")…

Then it follows with this section "for web developers" (which is how you detect the mode, roughly, via DOM APIs):

document . compatMode
Returns the string "BackCompat" if document’s mode is "quirks"; otherwise "CSS1Compat".


The most correct thing may be to rename this property to mode and continue using the quirks values because this is from an HTML processing perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my first inclination was to use the term "mode" from the spec, but that term is so generic and conflates with "insertion mode" and I was nervous using it.

this isn't currently exposed to calling code, and the terms come from the DOM spec, as you linked, and not from the HTML spec.

I can change it to document_mode, but I'm still nervous that adopting mode is going to be confusing, since it plays such a small role overall and stands against insertion_mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar concerns about mode being a very generic term. I do like document_mode, that would be my preference if you agree 👍

}
$this->skip_next_linefeed = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. Is this line necessary or can we just make sure it's reset when advancing to another token?

It's important that this method be idempotent. We call this from the processor and it can't change the state so that when consuming code calls it it's broken. And consuming code should not need to know that it can't call this more than once, this should be fine but wouldn't work correctly:

if ( '' !== $this->get_modifiable_text() ) {
  return "<h1>" . $this->get_modifiable_text() . "</h1>";
}

I noticed this both in the tests when debugging and in the html-api-debugger plugin. I always see the incorrect behavior (with this line) but I'm not sure where the method is being called twice in the plugin.

When this line is commented, things seem to work as expected and tests pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to change when advancing to the next token, because that wipes out the flag when moving from LISTING and PRE to #text.

I'll explore setting last_node_name or something like this. it's a good catch, thank you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 1323448 I've changed this to $this->skip_newline_at as a way of declaratively detecting where to ignore the newline. this is more resilient, but I still have concerns about seeking and updating the document.

I'll try to figure out seeking, but as it stands we don't expose a way to change a PRE or LISTING tag into something else, so I will not worry too much about fixing those cases until we allow getting into them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now I've simply listed the behavior after seeking as a limitation and warned to seek to the PRE or LISTING, not to the #text node. I don't know how to fix this without introducing unseen overhead in the Tag Processor.

do you think this is reasonable enough?

Comment on lines 104 to 129
/**
* Returns the name of the node at the nth position on the stack
* of open elements, or `null` if no such position exists.
*
* Note that this uses a 1-based index, which represents the
* "nth item" on the stack, counting from the top, where the
* top-most element is the 1st, the second is the 2nd, etc...
*
* @todo Skip over "marker" entries on the stack, if appropriate.
*
* @param int $nth Retrieve the nth item on the stack, with 1 being
* the top element, 2 being the second, etc...
* @return string|null Name of the node on the stack at the given location,
* or `null` if the location isn't on the stack.
*/
public function at( int $nth ) {
foreach ( $this->walk_down() as $item ) {
if ( 0 === --$nth ) {
return $item->node_name;
}
}

return null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it out for now and revisit after this lands, that's fine.

dmsnell and others added 2 commits July 22, 2024 11:17
Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>
@dmsnell dmsnell force-pushed the html-api/finish-in-body-insertion-mode branch from d4bb5fe to 8c38159 Compare July 22, 2024 18:44
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this.

This is a remaining issue where a newline is removed from a text node that is only a newline, resulting in a text node with no text. The text node should not be visited at all in this case:

<pre>&#x0A;</pre>

This PR is big enough and in a good place, I don't mind landing now and handling that known issue in a follow-up.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 22, 2024

Thanks for the work on this.

This is a remaining issue where a newline is removed from a text node that is only a newline, resulting in a text node with no text. The text node should not be visited at all in this case:

<pre>&#x0A;</pre>

This PR is big enough and in a good place, I don't mind landing now and handling that known issue in a follow-up.

Thanks @sirreal. I'm puzzled on this one. While there's no child node in the DOM, I find it strange that anyone would consider there to be no text node there. Perhaps I would feel differently about the actual newline byte 0x0A, but intentionally encoding it makes it seem that from a syntactic perspective, someone wanted it to be there.

Let's continue to explore this in the follow-up.

pento pushed a commit that referenced this pull request Jul 22, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for the remaining missing tags in the IN BODY insertion mode. Not
all of the added tags are supported, because in some cases they reset the
insertion mode and are reprocessed where they will be rejected.

This patch also improves the support of `get_modifiable_text()`, removing
a leading newline inside a LISTING, PRE, or TEXTAREA element.

Developed in #6972
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell, westonruter.
See #61576.


git-svn-id: https://develop.svn.wordpress.org/trunk@58779 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Jul 22, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for the remaining missing tags in the IN BODY insertion mode. Not
all of the added tags are supported, because in some cases they reset the
insertion mode and are reprocessed where they will be rejected.

This patch also improves the support of `get_modifiable_text()`, removing
a leading newline inside a LISTING, PRE, or TEXTAREA element.

Developed in WordPress/wordpress-develop#6972
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell, westonruter.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58779


git-svn-id: http://core.svn.wordpress.org/trunk@58181 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to gilzow/wordpress-performance that referenced this pull request Jul 22, 2024
As part of work to add more spec support to the HTML API, this patch adds
support for the remaining missing tags in the IN BODY insertion mode. Not
all of the added tags are supported, because in some cases they reset the
insertion mode and are reprocessed where they will be rejected.

This patch also improves the support of `get_modifiable_text()`, removing
a leading newline inside a LISTING, PRE, or TEXTAREA element.

Developed in WordPress/wordpress-develop#6972
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell, westonruter.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58779


git-svn-id: https://core.svn.wordpress.org/trunk@58181 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member Author

dmsnell commented Jul 22, 2024

Merged in [58779]
db30ce9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants