Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Parse doctypes and set full parser quirks mode correctly #7195

Conversation

sirreal
Copy link
Member

@sirreal sirreal commented Aug 14, 2024

  • Add complete parsing of DOCTYPE tokens. This is handled by a new class WP_HTML_Doctype_Info that handles parsing DOCTYPE tokens and exposes information about them.
  • Pause on DOCTYPE tokens in initial mode from the HTML processor. These tokens are not ignored (like most doctype tokens) and pausing allows the doctype to be visited and inspected by consumers of the HTML API.
  • Update full parser document mode (quirks mode) according to doctype tokens as described in the specification.
  • When the full parser advances from initial mode in the "any other" condition, it should set quirks mode. This has been added.
  • Update the html5lib tests to handle doctypes.

46 skipped tests from HTML5lib are now run. 1 test was disabled due to the way some whitespace is handled in the full parser.

 OK, but incomplete, skipped, or risky tests!
-Tests: 1498, Assertions: 930, Skipped: 568.
+Tests: 1497, Assertions: 975, Skipped: 522.

This change adds a new class to handle DOCTYPE token information according to the specification. The class is exposed from Tag and HTML processors when a DOCTYPE token is reached. DOCTYPE token information can be retrieved for inspection by calling $processor->get_doctype_info();. See this example form the HTML5lib-tests:

switch ( $token_type ) {
case '#doctype':
$doctype = $processor->get_doctype_info();
$output .= "<!DOCTYPE {$doctype->get_name()}";
if ( $doctype->get_public_identifier() || $doctype->get_system_identifier() ) {
$output .= " \"{$doctype->get_public_identifier()}\" \"{$doctype->get_system_identifier()}\"";
}
$output .= ">\n";
break;

The new class parses DOCTYPE Tokens in greater detail. This is useful because DOCTYPE tokens may appear in many places in HTML but are ignores in most situations. The detailed parsing of DOCTYPE tokens to be handled on-demand when a DOCTYPE token is reached under the appropriate circumstances.

The WP_HTML_Doctype_Info class also handles the complex rules for determining quirks mode which involve inspecting the DOCTYPE token name, public identifier, system identifier, and force_quirks_flag.

Trac ticket: https://core.trac.wordpress.org/ticket/61576


Survey of existing DOCTYPE declarations

Download the DOCTYPE report and cat report-doctypes.txt to see the color output.

Here is a preview:

Screenshot 2024-08-16 at 9 46 17 AM Screenshot 2024-08-16 at 9 46 31 AM

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Similar to text nodes, this change adds DOCTYPE tokens to the stack of
open elements so they can be reached when stepping through the document
via `next_token`.
This method handles parsing the doctype name from a doctype declaration.
This is important for the full HTML processor to be able to correctly
determin whether it is in quirks mode.
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

return $doctype[0];
}

public function parse_doctype(): ?array {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be protected, not public. We can set doctype details on processor state and add getters for them. Then we can do this parsing just once when the doctype is reached.

Maybe there's a better place for this parsing to happen.

> Anything else
> … set the Document to quirks mode.
Change the DOCTYPE status to suggest get_doctype_info over modifiable text.
It's a bit confusing because doctypes cannot set their "modifiable" text, which
makes the name modifiable awkward. It's unlikely this will be supported because
most docyptes are skipped, while other doctypes change how a document is parsed.
@sirreal sirreal requested a review from dmsnell August 21, 2024 13:02
@sirreal sirreal marked this pull request as ready for review August 21, 2024 13:02
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirreal this is quality work of the level I've come to expect from you. thank you so much!

in my latest push I've made some renames and documentation updates, plus added a special-case for the normative <!DOCTYPE html>.

if you have any issues with these we can address them in a follow-up, but as we discussed out-of-band, I intend to merge this when the tests pass and when I'm able.

in follow-up work we can examine how the document compatability mode interact with the Tag Processor and CSS functions.

@sirreal
Copy link
Member Author

sirreal commented Aug 23, 2024

Thank you! I've reviewed your changes and I'm happy with them.

pento pushed a commit that referenced this pull request Aug 23, 2024
This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element.

The DOCTYPE information is made available through a new method on the Tag Processor, `get_doctype_info()`.

Developed in #7195
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.


git-svn-id: https://develop.svn.wordpress.org/trunk@58925 602fd350-edb4-49c9-b593-d223f7449a82
@dmsnell
Copy link
Member

dmsnell commented Aug 23, 2024

Merged in [58925]
1139a51

@dmsnell dmsnell closed this Aug 23, 2024
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Aug 23, 2024
This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element.

The DOCTYPE information is made available through a new method on the Tag Processor, `get_doctype_info()`.

Developed in WordPress/wordpress-develop#7195
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58925


git-svn-id: http://core.svn.wordpress.org/trunk@58321 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Aug 23, 2024
This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element.

The DOCTYPE information is made available through a new method on the Tag Processor, `get_doctype_info()`.

Developed in WordPress/wordpress-develop#7195
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

Built from https://develop.svn.wordpress.org/trunk@58925


git-svn-id: https://core.svn.wordpress.org/trunk@58321 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@sirreal sirreal deleted the html-api/full-parser-doctype-quirks-mode-handling branch August 23, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants