Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: avoid processing incomplete tokens #5725

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Dec 1, 2023

Trac ticket: Core-60122

Depends on #5721

Currently the Tag Processor assumes that an input document is a full HTML document. Because of this, if there's lingering content after the last tag match it will treat that content as plaintext and skip over it. This is fine for the Tag Processor because if there is lingering content that isn't a valid tag then there's nothing for next_tag() to match.

However, in order to support a number of feature expansions it is important to recognize that the remaining content may involve partial syntax elements, such as incomplete tags, attributes, or comments.

In this patch we're adding a mode inside the Tag Processor which will flip when we start parsing HTML syntax but the document finishes before the token does. This will provide the ability to:

  • extend the input document
  • avoid misinterpreting syntax as text
  • guess if we have a complete document, know if we have an incomplete document

In the process of building this patch a few fixes were identified and fixed in the Tag Processor, namely in the handling of incomplete syntax elements.

@dmsnell dmsnell marked this pull request as ready for review December 7, 2023 12:05
@dmsnell dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch 2 times, most recently from 603db00 to 2393b92 Compare December 14, 2023 18:28
The HTML Tag Processor is able to know if it starts parsing a syntax element
and reaches the end of the document before it reaches the end of the element.
In these cases, after this patch, the processor will indicate this condition.

For example, when processing `<div><input type="te` there is an incomplete INPUT
element. The processor will fail to find the INPUT, it will pause right after
the DIV, and `paused_at_incomplete_token()` will return `true`.

This patch doesn't change any existing behaviors, but it adds the new method
to report on the final failure condition. It provides a mechanism for later
use to add chunked parsing to the class, wherein it will be possible to process
a document without having the entire document loaded in memory, for example
when processing unbuffered output.

This is also a necessary change for adding the ability to scan every token in
the document. Currently the Tag Processor only exposes tags as tokens, but it
will need to process `#text` nodes, HTML comments, and other markup in order
to enable behaviors in the HTML Processor and in refactors of existing HTML
processing in Core.
@dmsnell dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch from 8056376 to 35502ad Compare December 16, 2023 19:31
@dmsnell dmsnell changed the title WIP: HTML API: avoid processing incomplete tokens HTML API: avoid processing incomplete tokens Dec 18, 2023
dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Dec 20, 2023
Rebuild of 8905faf after WordPress#5725.

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>
Co-authored-by: Bernie Reiter <bernhard-reiter@git.wordpress.org>
@sirreal
Copy link
Member

sirreal commented Dec 20, 2023

I ran the tests from #5794 against this branch and it skips fewer tests, but all of those tests seem to fail. This may be a limitation with #5794 and not this PR. All of the new failures seem to be with test input that add <script> or <title> to the <head>. Were those tags unsupported before this PR? That would explain the problem.

I pulled the test data and the test suite from that branch (no fixes) and ran the tests with these results:

./vendor/bin/phpunit -c tests/phpunit/tests/html-api/phpunit.xml --filter Html5lib

Trunk:

Tests: 1693, Assertions: 306, Errors: 30, Failures: 2, Skipped: 1357.

This branch:

Tests: 1693, Assertions: 423, Errors: 30, Failures: 117, Skipped: 1240.

@sirreal
Copy link
Member

sirreal commented Dec 20, 2023

I think the problem is that the tag processor always assumes we're in body, but there are tests that add tags to head. So it's a limitation of the way we're using the tests now.

It's not clear to me why those tags were unsupported before this PR but are supported now… was that intentional?

@dmsnell
Copy link
Member Author

dmsnell commented Dec 20, 2023

@sirreal those tests need to be setup differently, because the HTML Processor currently only supports the <body> context element inside a fragment parser. that is, if it encounters <html> it's not the document HTML tag but a parse error and we're free to bail.

it's not that it assumes we're IN BODY but rather the default parameter to create_fragment() is <body> and any other context node would cause it to abort. we could setup those tests to handle that.

Rebuild of 8905faf after WordPress#5725.

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>
Co-authored-by: Bernie Reiter <bernhard-reiter@git.wordpress.org>
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the html-api/avoid-processing-incomplete-tokens branch from da5f9d9 to 9e7167a Compare December 20, 2023 16:46
@dmsnell
Copy link
Member Author

dmsnell commented Dec 20, 2023

With the merge of 1270194 everything is now passing or skipped in the html5lib test suite, thanks again @sirreal.

There were a couple of fixes I incorporated from your work there, and then I updated that branch with some specific test exclusions based on either a design question or a test that tests behaviors we don't currently support (but will probably do so once we support scanning all tokens).

The necessary change to the html5lib tests was to check for $p->paused_at_incomplete_token()

Co-authored-by: Jon Surrell <sirreal@users.noreply.github.com>
Copy link
Contributor

@ockham ockham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

Thank you for taking the time to walk me through these changes!
As always, the test coverage also really helps with confidence in these changes 🙂

@ockham
Copy link
Contributor

ockham commented Dec 20, 2023

Committed to Core in https://core.trac.wordpress.org/changeset/57211.

@ockham ockham closed this Dec 20, 2023
@dmsnell dmsnell deleted the html-api/avoid-processing-incomplete-tokens branch December 20, 2023 19:49
dmsnell added a commit to WordPress/gutenberg that referenced this pull request Dec 23, 2023
Updates from WordPress/wordpress-develop at 54a09a7ec99c59b8e640bd5aacebfdbf03bb02cc

 - WordPress/wordpress-develop#5535
   Adds support for H1 - H6 elements in the HTML Processor.

 - WordPress/wordpress-develop#5725
   Pause the Tag Processor when reaching the end of the document
   inside an incomplete syntax element.

 - Incorporates linting changes from "TODO:" to "todo"

This patch adds the blanket exclusion from the HTML API compatability
layer after they were removed and started blocking updates from Core.

The PHP files in the compatability layer are merged and maintained in
the Core repo and all changes or updates need to happen first in Core
and then be brought over to Gutenberg as built files. From Gutenberg's
perspective they are no different than NPM packages.

Co-authored-by: Anton Vlasenko <43744263+anton-vlasenko@users.noreply.github.com>
Co-authored-by: Bernie Reiter <96308+ockham@users.noreply.github.com>
dmsnell added a commit to WordPress/gutenberg that referenced this pull request Jan 1, 2024
Updates from WordPress/wordpress-develop at 54a09a7ec99c59b8e640bd5aacebfdbf03bb02cc

 - WordPress/wordpress-develop#5535
   Adds support for H1 - H6 elements in the HTML Processor.

 - WordPress/wordpress-develop#5725
   Pause the Tag Processor when reaching the end of the document
   inside an incomplete syntax element.

 - Incorporates linting changes from "TODO:" to "todo"

This patch adds the blanket exclusion from the HTML API compatability
layer after they were removed and started blocking updates from Core.

The PHP files in the compatability layer are merged and maintained in
the Core repo and all changes or updates need to happen first in Core
and then be brought over to Gutenberg as built files. From Gutenberg's
perspective they are no different than NPM packages.

Co-authored-by: Anton Vlasenko <43744263+anton-vlasenko@users.noreply.github.com>
Co-authored-by: Bernie Reiter <96308+ockham@users.noreply.github.com>
dmsnell added a commit to WordPress/gutenberg that referenced this pull request Jan 13, 2024
Updates from WordPress/wordpress-develop at 54a09a7ec99c59b8e640bd5aacebfdbf03bb02cc

 - WordPress/wordpress-develop#5535
   Adds support for H1 - H6 elements in the HTML Processor.

 - WordPress/wordpress-develop#5725
   Pause the Tag Processor when reaching the end of the document
   inside an incomplete syntax element.

 - Incorporates linting changes from "TODO:" to "todo".

 - WordPress/wordpress-develop#5762
   Handles the "all other tags" rules in IN BODY insertion mode.

 - WordPress/wordpress-develop#5539
   Adds support for list elements in the HTML Processor.

This patch adds the blanket exclusion from the HTML API compatability
layer after they were removed and started blocking updates from Core.

The PHP files in the compatability layer are merged and maintained in
the Core repo and all changes or updates need to happen first in Core
and then be brought over to Gutenberg as built files. From Gutenberg's
perspective they are no different than NPM packages.

Co-authored-by: Anton Vlasenko <43744263+anton-vlasenko@users.noreply.github.com>
Co-authored-by: Bernie Reiter <96308+ockham@users.noreply.github.com>
dmsnell added a commit to WordPress/gutenberg that referenced this pull request Jan 17, 2024
Updates from WordPress/wordpress-develop at 54a09a7ec99c59b8e640bd5aacebfdbf03bb02cc

 - WordPress/wordpress-develop#5535
   Adds support for H1 - H6 elements in the HTML Processor.

 - WordPress/wordpress-develop#5725
   Pause the Tag Processor when reaching the end of the document
   inside an incomplete syntax element.

 - Incorporates linting changes from "TODO:" to "todo".

 - WordPress/wordpress-develop#5762
   Handles the "all other tags" rules in IN BODY insertion mode.

 - WordPress/wordpress-develop#5539
   Adds support for list elements in the HTML Processor.

This patch adds the blanket exclusion from the HTML API compatability
layer after they were removed and started blocking updates from Core.

The PHP files in the compatability layer are merged and maintained in
the Core repo and all changes or updates need to happen first in Core
and then be brought over to Gutenberg as built files. From Gutenberg's
perspective they are no different than NPM packages.

Co-authored-by: Anton Vlasenko <43744263+anton-vlasenko@users.noreply.github.com>
Co-authored-by: Bernie Reiter <96308+ockham@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants