HTML API: Expose raw tag markup to support existing filters #5143

dmsnell · 2023-09-05T21:25:08Z

Current status:

Probably this will never merge and never should merge. It's an exploration of one way to preserve backwards compatability.

Summary

Many existing filters in Core pass segments of the raw HTML markup for matched tags where code implementing those hooks perform futher analyis on those segments. The HTML API attempts to hide the raw inner markup as much as possible, making it hard to provide full backwards compatability with these existing filters.

In this patch a new function is added which extracts the raw markup to maintain those existing behaviors. It requires a disclaimer to operate to try and encourage folks to use the structural and semantic methods provided by the HTML API.

Where possible, existing code and filters should be updated so that they no longer depend on the raw HTML.

To Consider

We could canonicalize the output here using the HTML5 serialization algorithm. This implies writing all attributes as double-quoted strings with proper escaping. Whitespace between attributes and the tag name should be normalized. Characters are properly escaped and quoted. This is already trivial to reconstruct through the existing methods, but incurs a runtime cost.

In this patch I've opted to propose a RAW method to avoid that cost, particularly since this is designed to be used in places in Core that are running frequently and on all data (kses, formatting, etc…). Using the raw contents risks leaving "broken" markup broken, leaving existing breakages in place, but should not break existing code relying on those breakages.

If, on the other hand, we normalize the HTML, we could potentially close existing bugs that fail because of the typical kind of parsing failures, regexps looking for double-quoted attributes, etc…

We can introduce another function to generate the normalized output, but I am hesitant to make things look too easy and accidentally encourage people to work around the semantic methods in the HTML API in favor of "easier" regexp parsing.

… existing filters) Many existing filters in Core pass segments of the raw HTML markup for matched tags where code implementing those hooks perform futher analyis on those segments. The HTML API attempts to hide the raw inner markup as much as possible, making it hard to provide full backwards compatability with these existing filters. In this patch a new function is added which extracts the raw markup to maintain those existing behaviors. It requires a disclaimer to operate to try and encourage folks to use the structural and semantic methods provided by the HTML API. Where possible, existing code and filters should be updated so that they no longer depend on the raw HTML.

ockham · 2023-09-12T15:15:57Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+		$disclaimer = <<<INTERNAL_ONLY
+			I understand that this is only for internal WordPress usage and something
+			will likely break if used elsewhere. This function comes with no warranty.
+INTERNAL_ONLY;
+
+		if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {
+			return null;
+		}


I'm not following the meaning of your emoji.

The point I'm trying to make here is that I really want the API to discourage use of this function. It's dangerous because it carries the legacy of interacting with HTML as a string that led to all the issues that brought us here.

In fact the only reason I'm considering this is a purely pragmatic backwards-compatibility issue. Core (and Jetpack and many other popular plugins) expose full or near-full HTML syntax through filters so that extending code can do further processing. For example, a full link HTML is passed into wp_targeted_link_rel and some plugins examine the URL to decide what additional rel values to add.

We could reconstruct normative HTML for these filters, but that would involve iterating over the attributes to create strings when we have one already. Maybe it's a worthwhile tradeoff, but I think most of those filters are unused and so if we can pass along a substring of an existing string and not touch it, that will have far less impact.

I'd prefer we find ways to update existing code to work structurally instead stringily. We don't need this function today; it's merely an aid to refactors that need the HTML API. If we can deprecate all those filters and leave time to update, then we can avoid this entire thing.

Thanks a lot for the added context and explanation, much appreciated!

The emoji meant something to the effect, "Wowzers, this one blows the one we use for the constructor out of the water. I guess that's the only precedent we have, but it's probably warranted."

Well I did realize that I can't make a class const private, so the one in the constructor could be easily foiled by calling WP_HTML_Processor::CONSTRUCTOR_UNLOCK_CODE.

In that case I'm less worried anyway because it's there to alert someone why something broke. In this case it's to try and put pressure against dangerous programming habits that will work if attempted, at least to a point.

Thanks for the detailed explanation, @dmsnell. Could you provide an example snippet used in Jetpack (or other popular plugin), and how that code should look like after this patch? Just to get a better sense of the changes introduced here.

@bor0 there is no code change required by this PR for other plugins and code authors. the point is to preserve backwards compatibility. however, here is one of many examples of how existing filters rely on the raw markup.

// formatting.php $rel = apply_filters( 'wp_targeted_link_rel', 'noopener', $link_html );

here, the $link_html is supposed to represent the full LINK element's HTML, such as <link rel="stylesheet" href="some/url">

lots of existing code then parses this with a regex or string-find, and breaks down in the same cases that normal HTML parsing fails: boolean attribute values, different quoting than expected, duplicate or differently-ordered attributes, escaping issues, etc…

ideally we would move Core towards a future where instead of passing around these strings we could pass around a kind of read-only or otherwise limited HTML Tag Processor, so that filters could reliably read attribute values, but changing the filter would mean breaking those existing plugins

tests/phpunit/tests/html-api/wpHtmlTagProcessor-internals.php

ockham · 2023-09-13T17:48:01Z

Sadly needs a rebase now 🙈 Would you mind taking care of that? I'll land it tomorrow morning 😊

dmsnell · 2023-09-13T18:59:35Z

Thanks @ockham - I should have created this as a draft as it needs much more consideration before merge.

See WordPress#5143 This patch refactors `wp_targeted_link_rel` to be more reliable and clear by using the HTML API to replace existing PCRE code. It provides backwards compatability for the `rel` filter by creating a quick HTML string representing the opening A tag for the matched element and providing the `href`, `rel`, and `target` attributes on that tag.

bor0 · 2023-11-17T12:05:59Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+			will likely break if used elsewhere. This function comes with no warranty.
+INTERNAL_ONLY;
+
+		if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {


Is it only necessary to check for tag_name_starts_at? If that property is set, does it follow that tag_ends_at is set too? If not, we could add the check here, but if it does, adding a clarifying (assertion) comment would improve code readability.

that's correct, and as internal values this isn't exposed. it's an idiom used throughout the Tag Processor. I've thought about different means to check what we're doing, but I don't think adding the same comment to twenty different places within the class would be that useful.

I do agree this is a bit indirect; more or less though I'm waiting until a much clearer and better solution appears.

bor0 · 2023-11-17T12:06:50Z

src/wp-includes/html-api/class-wp-html-tag-processor.php

+		$disclaimer = <<<INTERNAL_ONLY
+			I understand that this is only for internal WordPress usage and something
+			will likely break if used elsewhere. This function comes with no warranty.
+INTERNAL_ONLY;
+
+		if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {


This is an interesting approach. I haven't seen how we do things like this in WP - is this the first function of such nature?

Another potential idea is to prefix such functions with unsafe or something, like React does it :)

Gutenberg uses this in some places. we considered unsafe_ prefixes but I don't like those because of how normative they can become and how they can unintentionally lesson the impact of seeing unsafe_ in the code.

this is a method I think I would prefer we never merge, and instead of supporting these old filters we move to new ones entirely. if we do put these in I want to discourage people as much as possible to call them separately, but we need to make the methods public for Core's internal consumption.

dmsnell changed the title ~~HTML API: Expose internal function to extract raw tag markup (support existing filters)~~ HTML API: Expose raw tag markup to support existing filters Sep 5, 2023

dmsnell added 2 commits September 5, 2023 15:09

Remove comma from same line as HEREDOC

2719cfe

Do what could be done automatically, or left alone

7d43d42

ockham reviewed Sep 12, 2023

View reviewed changes