Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Expose raw tag markup to support existing filters #5143

Draft
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Sep 5, 2023

Trac ticket: #59291-trac

Current status:

Probably this will never merge and never should merge. It's an exploration of one way to preserve backwards compatability.

Summary

Many existing filters in Core pass segments of the raw HTML markup for matched tags where code implementing those hooks perform futher analyis on those segments. The HTML API attempts to hide the raw inner markup as much as possible, making it hard to provide full backwards compatability with these existing filters.

In this patch a new function is added which extracts the raw markup to maintain those existing behaviors. It requires a disclaimer to operate to try and encourage folks to use the structural and semantic methods provided by the HTML API.

Where possible, existing code and filters should be updated so that they no longer depend on the raw HTML.

To Consider

We could canonicalize the output here using the HTML5 serialization algorithm. This implies writing all attributes as double-quoted strings with proper escaping. Whitespace between attributes and the tag name should be normalized. Characters are properly escaped and quoted. This is already trivial to reconstruct through the existing methods, but incurs a runtime cost.

In this patch I've opted to propose a RAW method to avoid that cost, particularly since this is designed to be used in places in Core that are running frequently and on all data (kses, formatting, etc…). Using the raw contents risks leaving "broken" markup broken, leaving existing breakages in place, but should not break existing code relying on those breakages.

If, on the other hand, we normalize the HTML, we could potentially close existing bugs that fail because of the typical kind of parsing failures, regexps looking for double-quoted attributes, etc…

We can introduce another function to generate the normalized output, but I am hesitant to make things look too easy and accidentally encourage people to work around the semantic methods in the HTML API in favor of "easier" regexp parsing.

… existing filters)

Many existing filters in Core pass segments of the raw HTML markup for matched tags
where code implementing those hooks perform futher analyis on those segments. The
HTML API attempts to hide the raw inner markup as much as possible, making it hard
to provide full backwards compatability with these existing filters.

In this patch a new function is added which extracts the raw markup to maintain
those existing behaviors. It requires a disclaimer to operate to try and encourage
folks to use the structural and semantic methods provided by the HTML API.

Where possible, existing code and filters should be updated so that they no longer
depend on the raw HTML.
@dmsnell dmsnell changed the title HTML API: Expose internal function to extract raw tag markup (support existing filters) HTML API: Expose raw tag markup to support existing filters Sep 5, 2023
Comment on lines +730 to +737
$disclaimer = <<<INTERNAL_ONLY
I understand that this is only for internal WordPress usage and something
will likely break if used elsewhere. This function comes with no warranty.
INTERNAL_ONLY;

if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the meaning of your emoji.

The point I'm trying to make here is that I really want the API to discourage use of this function. It's dangerous because it carries the legacy of interacting with HTML as a string that led to all the issues that brought us here.

In fact the only reason I'm considering this is a purely pragmatic backwards-compatibility issue. Core (and Jetpack and many other popular plugins) expose full or near-full HTML syntax through filters so that extending code can do further processing. For example, a full link HTML is passed into wp_targeted_link_rel and some plugins examine the URL to decide what additional rel values to add.

We could reconstruct normative HTML for these filters, but that would involve iterating over the attributes to create strings when we have one already. Maybe it's a worthwhile tradeoff, but I think most of those filters are unused and so if we can pass along a substring of an existing string and not touch it, that will have far less impact.

I'd prefer we find ways to update existing code to work structurally instead stringily. We don't need this function today; it's merely an aid to refactors that need the HTML API. If we can deprecate all those filters and leave time to update, then we can avoid this entire thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the added context and explanation, much appreciated!


The emoji meant something to the effect, "Wowzers, this one blows the one we use for the constructor out of the water. I guess that's the only precedent we have, but it's probably warranted."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I did realize that I can't make a class const private, so the one in the constructor could be easily foiled by calling WP_HTML_Processor::CONSTRUCTOR_UNLOCK_CODE.

In that case I'm less worried anyway because it's there to alert someone why something broke. In this case it's to try and put pressure against dangerous programming habits that will work if attempted, at least to a point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, @dmsnell. Could you provide an example snippet used in Jetpack (or other popular plugin), and how that code should look like after this patch? Just to get a better sense of the changes introduced here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bor0 there is no code change required by this PR for other plugins and code authors. the point is to preserve backwards compatibility. however, here is one of many examples of how existing filters rely on the raw markup.

// formatting.php
$rel = apply_filters( 'wp_targeted_link_rel', 'noopener', $link_html );

here, the $link_html is supposed to represent the full LINK element's HTML, such as <link rel="stylesheet" href="some/url">

lots of existing code then parses this with a regex or string-find, and breaks down in the same cases that normal HTML parsing fails: boolean attribute values, different quoting than expected, duplicate or differently-ordered attributes, escaping issues, etc…

ideally we would move Core towards a future where instead of passing around these strings we could pass around a kind of read-only or otherwise limited HTML Tag Processor, so that filters could reliably read attribute values, but changing the filter would mean breaking those existing plugins

@ockham
Copy link
Contributor

ockham commented Sep 13, 2023

Sadly needs a rebase now 🙈 Would you mind taking care of that? I'll land it tomorrow morning 😊

@dmsnell dmsnell marked this pull request as draft September 13, 2023 18:58
@dmsnell
Copy link
Member Author

dmsnell commented Sep 13, 2023

Thanks @ockham - I should have created this as a draft as it needs much more consideration before merge.

dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Oct 30, 2023
See WordPress#5143

This patch refactors `wp_targeted_link_rel` to be more reliable and clear by
using the HTML API to replace existing PCRE code. It provides backwards
compatability for the `rel` filter by creating a quick HTML string representing
the opening A tag for the matched element and providing the `href`, `rel`, and
`target` attributes on that tag.
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Oct 30, 2023
See WordPress#5143

This patch refactors `wp_targeted_link_rel` to be more reliable and clear by
using the HTML API to replace existing PCRE code. It provides backwards
compatability for the `rel` filter by creating a quick HTML string representing
the opening A tag for the matched element and providing the `href`, `rel`, and
`target` attributes on that tag.
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Oct 30, 2023
See WordPress#5143

This patch refactors `wp_targeted_link_rel` to be more reliable and clear by
using the HTML API to replace existing PCRE code. It provides backwards
compatability for the `rel` filter by creating a quick HTML string representing
the opening A tag for the matched element and providing the `href`, `rel`, and
`target` attributes on that tag.
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Oct 30, 2023
See WordPress#5143

This patch refactors `wp_targeted_link_rel` to be more reliable and clear by
using the HTML API to replace existing PCRE code. It provides backwards
compatability for the `rel` filter by creating a quick HTML string representing
the opening A tag for the matched element and providing the `href`, `rel`, and
`target` attributes on that tag.
will likely break if used elsewhere. This function comes with no warranty.
INTERNAL_ONLY;

if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only necessary to check for tag_name_starts_at? If that property is set, does it follow that tag_ends_at is set too? If not, we could add the check here, but if it does, adding a clarifying (assertion) comment would improve code readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, and as internal values this isn't exposed. it's an idiom used throughout the Tag Processor. I've thought about different means to check what we're doing, but I don't think adding the same comment to twenty different places within the class would be that useful.

I do agree this is a bit indirect; more or less though I'm waiting until a much clearer and better solution appears.

Comment on lines +730 to +735
$disclaimer = <<<INTERNAL_ONLY
I understand that this is only for internal WordPress usage and something
will likely break if used elsewhere. This function comes with no warranty.
INTERNAL_ONLY;

if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting approach. I haven't seen how we do things like this in WP - is this the first function of such nature?

Another potential idea is to prefix such functions with unsafe or something, like React does it :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gutenberg uses this in some places. we considered unsafe_ prefixes but I don't like those because of how normative they can become and how they can unintentionally lesson the impact of seeing unsafe_ in the code.

this is a method I think I would prefer we never merge, and instead of supporting these old filters we move to new ones entirely. if we do put these in I want to discourage people as much as possible to call them separately, but we need to make the methods public for Core's internal consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants