Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Parse doctypes and set full parser quirks mode correctly #7195

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
42be2f6
Stop on DOCTYPE tokens in next_token
sirreal Aug 14, 2024
d41d9b3
Handle doctype tokens in html5lib-tests
sirreal Aug 14, 2024
17aff1d
Add get_doctype_name method to tag processor
sirreal Aug 14, 2024
6d73db8
Add missing-whitespace-before-doctype-name test
sirreal Aug 14, 2024
7550d10
Allow gotos in tag processor
sirreal Aug 14, 2024
2003f1d
Add get_compat_mode method
sirreal Aug 14, 2024
a36ceeb
WIP parsing doctypes
sirreal Aug 14, 2024
150d18c
Handle system doctype
sirreal Aug 14, 2024
16df2cd
Handle public-id only + whitespace
sirreal Aug 14, 2024
7217cc7
lint
sirreal Aug 14, 2024
b5c5cfa
Update html5lib-tests to get doctype name, publicid, systemid
sirreal Aug 14, 2024
deafee4
Parsing doctypes and handling quirks mode correctly
sirreal Aug 16, 2024
d359997
Fix logic error when parsing public and system identifiers
sirreal Aug 16, 2024
8d22b60
Disable pre-body failing whitespace text test
sirreal Aug 16, 2024
cda6f58
Fix multiline quote
sirreal Aug 16, 2024
5aeadf2
Scaffold doctype info class
sirreal Aug 16, 2024
601f53b
Return DOCTYPE info class from get_doctype_info method
sirreal Aug 16, 2024
edba9e1
Move quirks detect to get_compatibility_mode method
sirreal Aug 16, 2024
8250bcb
Remove get_compat_mode method from processor class
sirreal Aug 16, 2024
fee4b70
Always return string on doctype attribute lookups
sirreal Aug 16, 2024
8ea4451
Update tests to use get_doctype_info function
sirreal Aug 16, 2024
2ffff8d
Update tests to use doctype info
sirreal Aug 16, 2024
edac23f
Update test ticket number
sirreal Aug 16, 2024
f38fe1c
Add "quirks mode" to "anything else" initial mode
sirreal Aug 16, 2024
65cca88
Move doctype contents parsing into doctype_info
sirreal Aug 19, 2024
621afad
Better comments and naming
sirreal Aug 19, 2024
364d348
Improve more documentation in comments
sirreal Aug 19, 2024
7578082
Add more information to the class doc block
sirreal Aug 19, 2024
a68d5ca
Determing compat mode on initial doctype parse
sirreal Aug 19, 2024
78e8c64
Add more info about compatibility mode property strings
sirreal Aug 19, 2024
ef734be
Refactor doctype info class to use from_html factory
sirreal Aug 20, 2024
2a4807e
Fix equals alignment lint
sirreal Aug 20, 2024
995129b
Make doctype info properties public and add more documentation
sirreal Aug 20, 2024
8e68dd6
Update full parser compat mode from doctype handling
sirreal Aug 20, 2024
c59993a
Add readonly notes to doctype info properties
sirreal Aug 20, 2024
3bda3f5
Add newline normalization and null byte replacement
sirreal Aug 20, 2024
dd9cb57
Update tests to use direct property access
sirreal Aug 20, 2024
122e393
Fix off-by-one error on minimum length
sirreal Aug 20, 2024
bab37e5
Update missing doctype name test to use null
sirreal Aug 20, 2024
aa1912f
Move DOCTYPE tests to specific file
sirreal Aug 20, 2024
271681f
Fix lint
sirreal Aug 21, 2024
2a424e9
Merge branch 'trunk' into html-api/full-parser-doctype-quirks-mode-ha…
sirreal Aug 21, 2024
f76e4eb
Remove redundant extra argument in html5lib test helper
sirreal Aug 21, 2024
907987e
Check for undefined doctype identifiers in html5lib test trees
sirreal Aug 21, 2024
452a98c
Remove test default argument that can't be used
sirreal Aug 21, 2024
c44a0b6
Remove test arguments from removed dataProvider
sirreal Aug 21, 2024
4ab8c64
Update doctype comments
sirreal Aug 21, 2024
0cb32e9
Final pass on documentation comments
sirreal Aug 21, 2024
84c1faa
Documentating and naming updates.
dmsnell Aug 22, 2024
3c4aa1d
Add optimization for normative HTML DOCTYPE declaration.
dmsnell Aug 22, 2024
fe6cac9
Merge remote-tracking branch 'upstream/trunk' into html-api/full-pars…
dmsnell Aug 22, 2024
3db7230
Merge remote-tracking branch 'upstream/trunk' into html-api/full-pars…
dmsnell Aug 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions phpcs.xml.dist
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@
in the parsing, and distance the code from its standard. -->
<rule ref="Generic.PHP.DiscourageGoto.Found">
<exclude-pattern>/wp-includes/html-api/class-wp-html-processor\.php</exclude-pattern>
<exclude-pattern>/wp-includes/html-api/class-wp-html-doctype-info\.php</exclude-pattern>
</rule>

<!-- Exclude sample config from modernization to prevent breaking CI workflows based on WP-CLI scaffold.
Expand Down
616 changes: 616 additions & 0 deletions src/wp-includes/html-api/class-wp-html-doctype-info.php

Large diffs are not rendered by default.

12 changes: 5 additions & 7 deletions src/wp-includes/html-api/class-wp-html-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -1076,26 +1076,24 @@ private function step_initial(): bool {
* > A DOCTYPE token
*/
case 'html':
$contents = $this->get_modifiable_text();
if ( ' html' !== $contents ) {
/*
* @todo When the HTML Tag Processor fully parses the DOCTYPE declaration,
* this code should examine the contents to set the compatability mode.
*/
$this->bail( 'Cannot process any DOCTYPE other than a normative HTML5 doctype.' );
$doctype = $this->get_doctype_info();
if ( null !== $doctype && 'quirks' === $doctype->indicated_compatability_mode ) {
$this->state->document_mode = WP_HTML_Processor_State::QUIRKS_MODE;
}

/*
* > Then, switch the insertion mode to "before html".
*/
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML;
$this->insert_html_element( $this->state->current_token );
return true;
}

/*
* > Anything else
*/
initial_anything_else:
$this->state->document_mode = WP_HTML_Processor_State::QUIRKS_MODE;
$this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_BEFORE_HTML;
return $this->step( self::REPROCESS_CURRENT_NODE );
}
Expand Down
23 changes: 22 additions & 1 deletion src/wp-includes/html-api/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -4026,6 +4026,27 @@ private function matches(): bool {
return true;
}

/**
* Gets DOCTYPE declaration info from a DOCTYPE token.
*
* DOCTYPE tokens may appear in many places in an HTML document. In most places, they are
* simply ignored. The main parsing functions find the basic shape of DOCTYPE tokens but
* do not perform detailed parsing.
*
* This method can be called to perform a full parse of the DOCTYPE token and retrieve
* its information.
*
* @return WP_HTML_Doctype_Info|null The DOCTYPE declaration information or `null` if not
* currently at a DOCTYPE node.
*/
public function get_doctype_info(): ?WP_HTML_Doctype_Info {
if ( self::STATE_DOCTYPE !== $this->parser_state ) {
return null;
}

return WP_HTML_Doctype_Info::from_doctype_token( substr( $this->html, $this->token_starts_at, $this->token_length ) );
}

/**
* Parser Ready State.
*
Expand Down Expand Up @@ -4117,7 +4138,7 @@ private function matches(): bool {

/**
* Indicates that the parser has found a DOCTYPE node and it's
* possible to read and modify its modifiable text.
* possible to read its DOCTYPE information via `get_doctype_info()`.
*
* @since 6.5.0
*
Expand Down
1 change: 1 addition & 0 deletions src/wp-settings.php
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@
require ABSPATH . WPINC . '/html-api/html5-named-character-references.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-attribute-token.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-span.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-doctype-info.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-text-replacement.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-decoder.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-tag-processor.php';
Expand Down
118 changes: 118 additions & 0 deletions tests/phpunit/tests/html-api/wpHtmlDoctypeInfo.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
<?php
/**
* Unit tests covering WP_HTML_Doctype_Info functionality.
*
* @package WordPress
* @subpackage HTML-API
*/

/**
* @group html-api
*
* @coversDefaultClass WP_HTML_Doctype_Info
*/
class Tests_HtmlApi_WpHtmlDoctypeInfo extends WP_UnitTestCase {
/**
* Test DOCTYPE handling.
*
* @ticket 61576
*
* @dataProvider data_parseable_raw_doctypes
*/
public function test_doctype_doc_info(
string $html,
string $expected_compat_mode,
?string $expected_name = null,
?string $expected_public_id = null,
?string $expected_system_id = null
) {
$doctype = WP_HTML_Doctype_Info::from_doctype_token( $html );
$this->assertNotNull(
$doctype,
"Should have parsed the following doctype declaration: {$html}"
);

$this->assertSame(
$expected_compat_mode,
$doctype->indicated_compatability_mode,
'Failed to infer the expected document compatability mode.'
);

$this->assertSame(
$expected_name,
$doctype->name,
'Failed to parse the expected DOCTYPE name.'
);

$this->assertSame(
$expected_public_id,
$doctype->public_identifier,
'Failed to parse the expected DOCTYPE public identifier.'
);

$this->assertSame(
$expected_system_id,
$doctype->system_identifier,
'Failed to parse the expected DOCTYPE system identifier.'
);
}

/**
* Data provider.
*
* @return array[]
*/
public static function data_parseable_raw_doctypes(): array {
return array(
'Missing doctype name' => array( '<!DOCTYPE>', 'quirks' ),
'HTML5 doctype' => array( '<!DOCTYPE html>', 'no-quirks', 'html' ),
'HTML5 doctype no whitespace before name' => array( '<!DOCTYPEhtml>', 'no-quirks', 'html' ),
'XHTML doctype' => array( '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">', 'no-quirks', 'html', '-//W3C//DTD HTML 4.01//EN', 'http://www.w3.org/TR/html4/strict.dtd' ),
'SVG doctype' => array( '<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">', 'quirks', 'svg', '-//W3C//DTD SVG 1.1//EN', 'http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd' ),
'MathML doctype' => array( '<!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/mathml2.dtd">', 'quirks', 'math', '-//W3C//DTD MathML 2.0//EN', 'http://www.w3.org/Math/DTD/mathml2/mathml2.dtd' ),
'Doctype with null byte replacement' => array( "<!DOCTYPE null-\0 PUBLIC '\0' '\0\0'>", 'quirks', "null-\u{FFFD}", "\u{FFFD}", "\u{FFFD}\u{FFFD}" ),
'Uppercase doctype' => array( '<!DOCTYPE UPPERCASE>', 'quirks', 'uppercase' ),
'Lowercase doctype' => array( '<!doctype lowercase>', 'quirks', 'lowercase' ),
'Doctype with whitespace' => array( "<!DOCTYPE\n\thtml\f\rPUBLIC\r\n''\t''>", 'no-quirks', 'html', '', '' ),
'Doctype trailing characters' => array( "<!DOCTYPE html PUBLIC '' '' Anything (except closing angle bracket) is just fine here !!!>", 'no-quirks', 'html', '', '' ),
'An ugly no-quirks doctype' => array( "<!dOcTyPehtml\tPublIC\"pub-id\"'sysid'>", 'no-quirks', 'html', 'pub-id', 'sysid' ),
'Missing public ID' => array( '<!DOCTYPE html PUBLIC>', 'quirks', 'html' ),
'Missing system ID' => array( '<!DOCTYPE html SYSTEM>', 'quirks', 'html' ),
'Missing close quote public ID' => array( "<!DOCTYPE html PUBLIC 'xyz>", 'quirks', 'html', 'xyz' ),
'Missing close quote system ID' => array( "<!DOCTYPE html SYSTEM 'xyz>", 'quirks', 'html', null, 'xyz' ),
'Missing close quote system ID with public' => array( "<!DOCTYPE html PUBLIC 'abc' 'xyz>", 'quirks', 'html', 'abc', 'xyz' ),
'Bogus characters instead of system/public' => array( '<!DOCTYPE html FOOBAR>', 'quirks', 'html' ),
'Bogus characters instead of PUBLIC quote' => array( "<!DOCTYPE html PUBLIC x ''''>", 'quirks', 'html' ),
'Bogus characters instead of SYSTEM quote ' => array( "<!DOCTYPE html SYSTEM x ''>", 'quirks', 'html' ),
'Emoji' => array( '<!DOCTYPE 🏴󠁧󠁢󠁥󠁮󠁧󠁿 PUBLIC "🔥" "😈">', 'quirks', "\u{1F3F4}\u{E0067}\u{E0062}\u{E0065}\u{E006E}\u{E0067}\u{E007F}", '🔥', '😈' ),
'Bogus characters instead of SYSTEM quote after public' => array( "<!DOCTYPE html PUBLIC ''x''>", 'quirks', 'html', '' ),
'Special quirks mode if system unset' => array( '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//">', 'quirks', 'html', '-//W3C//DTD HTML 4.01 Frameset//' ),
'Special limited-quirks mode if system set' => array( '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//" "">', 'limited-quirks', 'html', '-//W3C//DTD HTML 4.01 Frameset//', '' ),
);
}

/**
* @dataProvider invalid_inputs
*
* @ticket 61576
*/
public function test_invalid_inputs_return_null( string $html ) {
$this->assertNull( WP_HTML_Doctype_Info::from_doctype_token( $html ) );
}

/**
* Data provider.
*
* @return array[]
*/
public static function invalid_inputs(): array {
return array(
'Empty string' => array( '' ),
'Other HTML' => array( '<div>' ),
'DOCTYPE after HTML' => array( 'x<!DOCTYPE>' ),
'DOCTYPE before HTML' => array( '<!DOCTYPE>x' ),
'Incomplete DOCTYPE' => array( '<!DOCTYPE' ),
'Pseudo DOCTYPE containing ">"' => array( '<!DOCTYPE html PUBLIC ">">' ),
);
}
}
24 changes: 13 additions & 11 deletions tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ class Tests_HtmlApi_Html5lib extends WP_UnitTestCase {
const SKIP_TESTS = array(
'comments01/line0155' => 'Unimplemented: Need to access raw comment text on non-normative comments.',
'comments01/line0169' => 'Unimplemented: Need to access raw comment text on non-normative comments.',
'doctype01/line0380' => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly',
'html5test-com/line0129' => 'Unimplemented: Need to access raw comment text on non-normative comments.',
'noscript01/line0014' => 'Unimplemented: This parser does not add missing attributes to existing HTML or BODY tags.',
'tests1/line0692' => 'Bug: Mixed whitespace, non-whitespace text in head not split correctly',
Expand Down Expand Up @@ -115,7 +116,7 @@ public function data_external_html5lib_tests() {

$test_context_element = $test[1];

if ( self::should_skip_test( $test_context_element, $test_name, $test[3] ) ) {
if ( self::should_skip_test( $test_context_element, $test_name ) ) {
continue;
}

Expand All @@ -133,7 +134,7 @@ public function data_external_html5lib_tests() {
*
* @return bool True if the test case should be skipped. False otherwise.
*/
private static function should_skip_test( ?string $test_context_element, string $test_name, string $expected_tree ): bool {
private static function should_skip_test( ?string $test_context_element, string $test_name ): bool {
if ( null !== $test_context_element && 'body' !== $test_context_element ) {
return true;
}
Expand Down Expand Up @@ -189,6 +190,15 @@ private static function build_tree_representation( ?string $fragment_context, st
}

switch ( $token_type ) {
case '#doctype':
$doctype = $processor->get_doctype_info();
$output .= "<!DOCTYPE {$doctype->name}";
if ( null !== $doctype->public_identifier || null !== $doctype->system_identifier ) {
$output .= " \"{$doctype->public_identifier}\" \"{$doctype->system_identifier}\"";
}
$output .= ">\n";
break;

case '#tag':
$namespace = $processor->get_namespace();
$tag_name = 'html' === $namespace
Expand Down Expand Up @@ -450,15 +460,7 @@ public static function parse_html5_dat_testfile( $filename ) {
*/
case 'document':
if ( '|' === $line[0] ) {
/*
* The next_token() method these tests rely on do not stop
* at doctype nodes. Strip doctypes from output.
* @todo Restore this line if and when the processor
* exposes doctypes.
*/
if ( '| <!DOCTYPE ' !== substr( $line, 0, 12 ) ) {
$test_dom .= substr( $line, 2 );
}
$test_dom .= substr( $line, 2 );
} else {
// This is a text node that includes unescaped newlines.
// Everything else should be singles lines starting with "| ".
Expand Down
16 changes: 16 additions & 0 deletions tests/phpunit/tests/html-api/wpHtmlTagProcessor.php
Original file line number Diff line number Diff line change
Expand Up @@ -2939,4 +2939,20 @@ public function test_unclosed_funky_comment_input_too_short() {
$this->assertFalse( $processor->next_tag() );
$this->assertTrue( $processor->paused_at_incomplete_token() );
}

/**
* Test basic DOCTYPE handling.
*
* @ticket 61576
*/
public function test_doctype_doc_name() {
$processor = new WP_HTML_Tag_Processor( '<!DOCTYPE html>' );
$this->assertTrue( $processor->next_token() );
$doctype = $processor->get_doctype_info();
$this->assertNotNull( $doctype );
$this->assertSame( 'html', $doctype->name );
$this->assertSame( 'no-quirks', $doctype->indicated_compatability_mode );
$this->assertNull( $doctype->public_identifier );
$this->assertNull( $doctype->system_identifier );
}
}
Loading