-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wp html processor text based #2
base: trunk
Are you sure you want to change the base?
Conversation
…king a tag closer This commit marks the start of a bookmark one byte before the tag name start for tag openers, and two bytes before the tag name for tag closers. Setting a bookmark on a tag should set its "start" position before the opening "<", e.g.: ``` <div> Testing a <b>Bookmark</b> ----------------^ ``` The current calculation assumes this is always one byte to the left from $tag_name_starts_at. However, in tag closers that index points to a solidus symbol "/": ``` <div> Testing a <b>Bookmark</b> ----------------------------^ ``` The bookmark should therefore start two bytes before the tag name: ``` <div> Testing a <b>Bookmark</b> ---------------------------^ ```
…closers' into wp_html_processor
9a29748
to
37659fb
Compare
dbg( "Found {$this->current_token->tag} tag opener" ); | ||
switch ( $this->current_token->tag ) { | ||
case 'HTML': | ||
$this->drop_current_tag_token(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose here? it seems like we're modifying a document that nobody asked to modify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific case is a hack I added to prevent a failure upon finding an HTML
tag. It shouldn't be needed anymore once we look into other insertion modes. Now that I think about it, perhaps a return
statement to ignore it would be a better hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the general drop_current_tag_token
function, it used to be ignore_token
. Let's make it ignore_token
again – there's no need to make such uninvited updates.
if ( $this->is_element_in_button_scope( 'P' ) ) { | ||
$this->close_p_element(); | ||
} | ||
if ( in_array( $this->current_node()->tag, array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' ) ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this implied by the case waterfall?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch checks the $this->current_token->tag
while this check looks into $this->current_node()
– the most recent entry in the open elements stack. In other words, this ensures that h2
in markup like <div><h1>Primary heading <h2> Secondary heading</div>
does not become a child of h1
if($token !== $this->current_token) { | ||
// Aesthetic choice for now. | ||
// @TODO: discuss it with the team | ||
$tag = strtolower($token->tag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of respecting what someone sends. if they send dIv
then that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me 👍
) | ||
); | ||
} | ||
array_push($this->open_elements, $token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't matter right now, but I'm strongly opposed to offering a function which would allow people to add a partial tag, an opener without a closer. adding a void or self-closing foreign element seems fine, but leaving an open element on the stack seems to be an invitation for trouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! I didn't mean to offer this method to people. insert_element
is a private API that implements the insert an HTML element for a token part of the parsing spec.
In the DOM PR, it was quite a bit more complex, but in this stream processing version it all collapses to inserting a tag opener and pushing an element to stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a thing I'm most worried about with partial HTMl support is properly indicating that we bailed. if we return false
I wonder if that's enough to communicate that we didn't change anything. same goes if we return the original un-modified string.
I don't have an answer here; I'm just raising the question
I'm also been wondering what would be the ideal API for that. Maybe throwing an error: try {
$p = new WP_HTML_Processor( $html_which_may_be_broken );
// process...
} catch (Exception $e) {
// The HTML was broken and it couldn't be processed, do whatever you need
// to do in this situation.
} |
I have a related worry – what if we need to bale out after applying some updates already? For example: $p = new WP_HTML_Processor( '<div></div><b><section></b></section>' );
$p->find('div');
$p->set_inner_html('Yay!');
$p->get_updated_html();
// <div>Yay!</div><b><section></b></section>
$p->find('section');
$p->set_inner_html('Yay!'); // throws? or returns false?
$p->get_updated_html(); // <div>Yay!</div><b><section></b></section>
$p->find('div'); // does this return true? or are we in unrecoverable state now?
I can only see two ways to proceed:
I can't easily tell if 1 is safe. Intuitively it should be since it's dealing only with the markup that's well-formed. If it's not safe, though, we're stuck with parsing the entire markup at least once. |
My idea of how this was supposed to work was that And my throw-an-error suggestion was based on the assumption that it can fail at any time (even on the first But based on your comment, I guess the throw-an-error idea only works if you can't get any modified HTML until the very end. I mean, getting the modified HTML finalizes the processing. |
In Tag Processor The big reason is performance. Tag Processor won't process any markup it doesn't have to process. This means you can update the first ten tags in a very long document and never bear the cost of parsing all of it. Performance is even more of a concern in HTML Processor but correctness matters even more. The big question is whether it's okay to make a partial update. My intuition says it should be fine, but I need to think about it more. If we need to go through the entire thing, throwing an exception sounds nice. I actually wonder – can we |
We try really hard to avoid throwing in Core because we don't want to crash peoples' sites. As convenient as it can be at times, a corrupted site is generally preferable to a white screen.
For the HTML processor as we talk through these things I think the only viable way to approach this at least as an initial run is to scan through the entire document on It doesn't mean we have to give up performance for this aspect. On that initial modification operation we can simply finish scanning and verify the document structure. We might even be able to set an internal flag indicating that the rest of the document is fine, letting us proceed after that on the same processor instance, without re-scanning. Finally we can iterate to look at the impact of finding un-opened tags - I can't remember off the top of my head - because if we're at a point in the document where tags are balanced I don't see a problem making modifications there and then bailing only when the problems appear. There is a can of worms here and I think we're closer to settling on an interface, but I bet that's still three and a half weeks away. This HTML processor operates on HTML that exists within a broader HTML document so there is the likelihood of partial input or unexpected context which would flavor what we're doing within the domain of the processor itself. i.e. Some things that look fine to us won't be inside the page just as some things that look broken to us might be fine when stitched back in to the page. As for errors, if it's not enough to return
|
$active_formatting_elements = $this->active_formatting_elements; | ||
|
||
/** | ||
* seek() will rewing before the current tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: rewing
> rewind
@dmsnell Why would you say this is the only viable approach? I’m still on the fence about this. Is there anything wrong with modifying the well-formed part of the markup? Maybe let’s just allow it and only refuse to process things once we run into unsupported markup?
Most of the time they're ignored or, IIRC, assumed to close the current tag.
I'd rather stick to |
yeah I always forget about that one (probably because I don't like it 🙃)
it was based on some things you said, which we started discussing outside of this issue. if we don't have to look out for tags later on that would change tags beforehand then what I said is wrong.
I think it's fine to not have this, or if we add it, to only add it in late once we have that in another. class. I think we can build it purely from one of the things I love about this is another part of what I am nervous about: the big switch statement handling tag openings. I'm wondering if we could blend the idea of a I've thought that it might help to grab the reference to a class once and then reference static flags on that class multiple times. still the tradeoff between supporting things we can vs. the simplicity of supporting only a well-defined small subset irks me. my partner PR doesn't support |
When saving options from the Settings page, include the `'ping_sites'` option in the allowed "writing" options list only when the `'blog_public'` option is `'1'`. Fixes a PHP 8.1 and above "null to non-nullable" deprecation notice in `sanitize_option()` ([https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?annotate=blame#L4952 which happens when here] as part of [22255]): {{{ Deprecated: explode(): Passing null to parameter #2 ($string) of type string is deprecated in .../wp-includes/formatting.php }}} **Explanation** [https://developer.wordpress.org/apis/options/#writing Per the documentation], the `ping_sites` option requires the `'blog_public'` option to have a value of `'1'` and must be a `string` data type. `null` is not valid for this option. The relationship between the 2 options shows itself in the `options-writing.php` code ([https://core.trac.wordpress.org/browser/tags/6.5.4/src/wp-admin/options-writing.php#L233 shown here] and in [4326]), as the `textarea#ping_sites` only renders when `'1' === get_option( 'blog_public' )`. **What happens if `'blog_public'` is not `'1'`?** The `'ping_sites'` option will not be a field on the page. Upon saving: * HTTP POST (`$_POST`) does not include `'ping_sites'`. * Before this commit: * The [https://core.trac.wordpress.org/browser/trunk/src/wp-admin/options.php#L333 option's value was set to] `null` before being passed to `update_option()`. * `update_option()` invokes `sanitize_option()`. * A `null` value for the `'ping_sites'` case was passed to `explode()`, which threw a deprecation notice on PHP 8.1 and above. * With this commit, the `'ping_sites'` option is no longer included in the allow list and thus will not be passed to `update_options()` > `sanitize_option()` > `explode()`. Follow-up to [22255], [12825], [4326], [949]. Props kitchin, SergeyBiryukov, swissspidy, devmuhib, rajinsharwar, hellofromTonya. Fixes #59818. git-svn-id: https://develop.svn.wordpress.org/trunk@58425 602fd350-edb4-49c9-b593-d223f7449a82
…t_mime_types(). Fixes a PHP 8.1 and above "null to non-nullable" deprecation notice in `get_available_post_mime_types()`: {{{ Deprecated: preg_match(): Passing null to parameter #2 ($subject) of type string is deprecated in ./wp-includes/post.php on line 3395 }}} [https://developer.wordpress.org/reference/functions/get_available_post_mime_types/#return This function is documented] to: * Return `An array of MIME types.` * as an array of `string`s, i.e. `string[]`. A `null` or empty element within the returned array is not a valid MIME type. If a `null` exists in the returned array, it is the root cause of PHP throwing the deprecation notice. This commit removes the `null` and empty elements from the returned array of MIME types. It also adds a unit test. Follow-up to [56623], [56452]. Props nosilver4u, jrf, ironprogrammer, antpb, antonvlasenko, rajinsharwar, hellofromTonya. Fixes #59195. git-svn-id: https://develop.svn.wordpress.org/trunk@58437 602fd350-edb4-49c9-b593-d223f7449a82
Attempt at WordPress#4125 but without expanding the DOM tree in memory
Supported features
nth_child
next_sibling`inner_html
andouter_html
<ul><li>1<li>2<li>3</ul>
Missing features
<table><tr><td><tr><td></table>
<p><b>First<p>Second
. Currently, theb
tag is forcibly removed from the list of active formatting elements when the second<p>
is encounteredfind( $css_selector )
Stats
Current stats for parsing the 13MB single page HTML parsing spec document:
That's pretty good!
cc @dmsnell @ockham