Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore HTML parsing and Adoption Agency Algorithm #1

Closed
wants to merge 15 commits into from

Conversation

adamziel
Copy link
Owner

@adamziel adamziel commented Feb 21, 2023

Closing in favor of more visible WordPress#4125

…king a tag closer

This commit marks the start of a bookmark one byte before
the tag name start for tag openers, and two bytes before
the tag name for tag closers.

Setting a bookmark on a tag should set its "start" position
before the opening "<", e.g.:

```
<div> Testing a <b>Bookmark</b>
----------------^
```

The current calculation assumes this is always one byte
to the left from $tag_name_starts_at.

However, in tag closers that index points to a solidus
symbol "/":

```
<div> Testing a <b>Bookmark</b>
----------------------------^
```

The bookmark should therefore start two bytes before the tag name:

```
<div> Testing a <b>Bookmark</b>
---------------------------^
```
@adamziel
Copy link
Owner Author

adamziel commented Feb 22, 2023

This implementation works!

I benchmarked it on the HTML parsing spec itself, which is a 12MB HTML document:

I tried parsing the HTML spec page (12MB):

Mem peak usage: 499MB
Time: 30.60s

That's pretty terrible! It's also not surprising. This PR builds an actual document tree and uses inefficient operations such as array_splice.

A text-based version similar to WP_HTML_Tag_Processor should be much faster and more memory-efficient. Let's explore one!

@adamziel
Copy link
Owner Author

adamziel commented Feb 23, 2023

Adoption Agency Algorithm requires a full pass through the HTML document

In the worst-case scenario, the entire document must be parsed to know even the second node.

Consider this markup:

<b>
 <div>
    <div><!-- 100k tags amounting to 2 MB of normative HTML --></div>
    </b> <!-- suddenly, a rogue </b> -->
  </div>
</b>

The correct DOM would be:

B
DIV
└─ B
      └─ DIV (with 100k tags)

The adoption agency algorithm makes the <div> a direct child of <html> only once we process the misnested </b>.

What if we built an HTML normalizer instead?

Since the entire markup must be processed upfront, this could work just as well:

class WP_HTML_Processor {

     public function __construct( $html, $options ) {
         // Apply HTML parsing rules first, unless explicitly asked not to
         if ( true !== $options['is_normative'] ) {
              $html = WP_HTML_Normalizer::normalize( $html );
         }

         // From now on, we assume normative markup
         $this->html = $html;
     }

     public function next_by_css( $selector );
     public function set_inner_html( $html );

     // ...

cc @dmsnell @ockham

@adamziel adamziel closed this Feb 24, 2023
adamziel pushed a commit that referenced this pull request Mar 2, 2023
…air screen.

The table is no longer created by core as of WordPress 3.0, and support for global terms was removed in WordPress 6.1, so `$wpdb->sitecategories` is unset by default.

This commit resolves a "passing null to non-nullable" deprecation notice on PHP 8.1:
{{{
Deprecated: addcslashes(): Passing null to parameter #1 ($string) of type string is deprecated in wp-includes/class-wpdb.php on line 1804
}}}

The `tables_to_repair` filter is available for plugins to readd the table or include any additional tables to repair.

Follow-up to [14854], [14880], [54240].

Props ipajen, chiragrathod103, SergeyBiryukov.
Fixes #57762.

git-svn-id: https://develop.svn.wordpress.org/trunk@55421 602fd350-edb4-49c9-b593-d223f7449a82
adamziel pushed a commit that referenced this pull request Oct 13, 2023
…om next_posts().

The `esc_url()` function expects to a string for `$url` parameter. There is no input validation within that function. The function contains a `ltrim()` which also expects a string. Passing `null` to this parameter results in `Deprecated: ltrim(): Passing null to parameter #1 ($string) of type string is deprecated` notice on PHP 8.1+.

Tracing the stack back, a `null` is being passed to it within `next_posts()` when `get_next_posts_page_link()` returns `null` (it can return a string or `null`).

On PHP 7.0 to PHP 8.x, an empty string is returned from `esc_url()` when `null` is passed to it. The change in this changeset avoids the deprecation notice by not invoking `esc_url()` when `get_next_posts_page_link()` returns `null` and instead sets the `$output` to an empty string, thus maintain the same behavior as before (minus the deprecation notice).

Adds a test to validate an empty string is returned and the absence of the deprecation (when running on PHP 8.1+).

Follow-up to [11383], [9632].

Props codersantosh, nihar007, hellofromTonya, mukesh27, oglekler, rajinsharwar.
Fixes #59154.

git-svn-id: https://develop.svn.wordpress.org/trunk@56740 602fd350-edb4-49c9-b593-d223f7449a82
adamziel pushed a commit that referenced this pull request Aug 16, 2024
…Info screen.

This resolves a fatal error if `strict_types` PHP setting is enabled:
{{{
Argument #1 ($num) must be of type float, string given
}}}

Since the goal of the Site Health Info screen is to display raw values where possible, the `number_format()` call here does not seem to provide any benefit.

Props krishneup, sabernhardt, audrasjb, SergeyBiryukov.
Fixes #60364.

git-svn-id: https://develop.svn.wordpress.org/trunk@58847 602fd350-edb4-49c9-b593-d223f7449a82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant