You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tokenization stage of parsing HTML contains error codes. The tree-construction stage does not have separate error codes and instead just has generic errors.
I'd like to fix that by providing error codes for the tree construction.
To that end, I've searched through the tree construction section of the standard and identified 118 separate parse errors and made a first pass at giving some names to these. I also separated one of the errors into two separate ones since a unified error for the two cases (unexpected-start-tag and unexpected-end-tag) that were grouped together because all they did was emit an error. Therefore, the list below contains 119 instances of parse error, separated by section and presented in order.
This is just a first attempt and I'd appreciate feedback. Specifically, should some of the generic error codes be replaced with more specific codes? Should some of the specific codes be rolled into a more generic error code? Are there better names for the error codes?
My interest in this is as a maintainer of Nokogiri and I'd like to test that its HTML parser is correctly producing each specific error.
unexpected-end-tag An end tag whose tag name is template
unexpected-open-elements An end tag whose tag name is template
unexpected-start-tag A start tag whose tag name is head
unexpected-end-tag Any other end tag
13.2.6.4.5 The "in head noscript" insertion mode
unexpected-doctype DOCTYPE token
unexpected-end-tag Any other end tag
unexpected-token Anything else
13.2.6.4.6 The "after head" insertion mode
unexpected-doctype DOCTYPE token
unexpected-start-tag A start tag whose tag name is one of: base, basefont,
bgsound, link, meta, noframes, script, style, template, title
unexpected-end-tag Any other end tag
13.2.6.4.7 The "in body" insertion mode
unexpected-null-character A character token that is U+0000 NULL (name
clashes with the tokenizer error)
unexpected-doctype DOCTYPE token
unexpected-start-tag A start tag whose tag name is html
unexpected-start-tag A start tag whose tag name is body
unexpected-start-tag A start tag whose tag name is frameset
unexpected-eof An end-of-file token
body-element-not-in-scope An end tag whose tag name is body
unexpected-open-elements An end tag whose tag name is body
body-element-not-in-scope An end tag whose tag name is html
unexpected-open-elements An end tag whose tag name is body
nested-heading-elements A start tag whose tag name is one of: h1, h2, h3, h4, h5, h6
nested-form-elements A start tag whose tag name is form
unexpected-open-elements A start tag whose tag name is li
unexpected-open-elements A start tag whose tag name is one of: dd, dt
unexpected-open-elements A start tag whose tag name is one of: dd, dt
nested-button-elements A start tag whose tag name is button
unexpected-end-tag An end tag whose tag name is one of address, article,
aside, blockquote, button, center, details, dialog, dir, div, dl, fieldset,
figcaption, figure, footer, header, hgroup, listing, main, menu, nav, ol,
pre, search, section, summary, ul
unexpected-open-elements An end tag whose tag name is one of address,
article, ...
unexpected-end-tag An end tag whose tag name is form
unexpected-open-elements An end tag whose tag name is form
unexpected-end-tag An end tag whose tag name is form
unexpected-open-elements An end tag whose tag name is form
no-matching-element-in-scope An end tag whose tag name is p
no-matching-element-in-scope An end tag whose tag name is li
unexpected-open-elements An end tag whose tag name is li
no-matching-element-in-scope An end tag whose tag name is one of: dd, dt
unexpected-open-elements An end tag whose tag name is dd, dt
no-heading-element-in-scope An end tag whose tag name is one of: h1, h2, h3,
h4, h5, h6
unexpected-open-elements An end tag whose tag name is one of: h1, h2, h3, h4,
h5, h6
nested-a-elements A start tag whose tag name is a
nobr-element-in-scope A start tag whose tag name is nobr
no-matching-element-in-scope An end tag whose tag name is one of: applet,
marquee, object
unexpected-open-elements An end tag whose tag name is one of: applet,
marquee, object
br-end-tag An end tag whose tag name is br
image-start-tag A start tag whose tag name is image
parent-not-ruby A start tag whose tag name is one of rb, rtc
parent-not-ruby A start tag whose tag name is one of rp, rt
unexpected-start-tag A start tag whose tag name is one of: caption, col,
colgroup, frame, head, tbody, td, tfoot, th, thead, tr
no-caption-in-scope An end tag whose tag name is caption
unexpected-open-elements An end tag whose tag name is caption
no-caption-in-scope A start tag whose tag name is one of: caption, col,
colgroup, tbody, td, tfoot, th, thead, tr; An end tag whose tag name is
table
unexpected-open-elements An end tag whose tag name is one of: caption, col,
colgroup, tbody, td, tfoot, th, thead, thr; An end tag whose tag name is
table
unexpected-end-tag An end tag whose tag name is one of body, col, colgroup,
html, tbody, td, tfoot, th, thead, tr
13.2.6.4.12 The "in column group" insertion mode
unexpected-doctype DOCTYPE token
not-colgroup An end tag whose tag name is colgroup
unexpected-end-tag An end tag whose tag name is col
not-colgroup Anything else
13.2.6.4.13 The "in table body" insertion mode
unexpected-start-tag A start tag whose tag name is one of th, td
no-matching-element-in-scope An end tag whose tag name is one of: tbody,
tfoot, thead
no-table-sectioning-element-in-scope A start tag whose tag name is one of:
caption, col, colgroup, tbody, tfoot, thead; An end tag whose tag name is
table
unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
colgroup, html, td, th, tr
13.2.6.4.14 The "in row" insertion mode
no-tr-in-scope An end tag whose tag name is tr
no-tr-in-scope A start tag whose tag name is one of: caption, col, colgroup,
tbody, tfoot, thead, tr; An end tag whose tag name is table
no-tr-in-scope An end tag whose tag name is one of: tbody, tfoot, thead
unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
colgroup, html, td, th
13.2.6.4.15 The "in cell" insertion mode
no-matching-element-in-scope An end tag whose tag name is one of: td, th
unexpected-open-elements An end tag whose tag name is one of: td, th
unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
colgroup, html
no-matching-element-in-scope An end tag whose tag name is one of: table,
tbody, tfoot, thead, tr
unexpected-open-elements close the cell step 2
13.2.6.4.16 The "in select" insertion mode
unexpected-null-character A character token that is U+0000 NULL
unexpected-doctype DOCTYPE token
not-optgroup An end tag whose tag name is optgroup
not-option An end tag whose tag name is option
no-matching-element-in-scope An end tag whose tag name is select
unexpected-start-tag A start tag whose tag name is select
unexpected-start-tag A start tag whose tag name is one of: input, keygen,
textarea
unexpected-content-in-select Anything else
13.2.6.4.17 The "in select in table" insertion mode
unexpected-start-tag A start tag whose tag name is one of: caption, table,
tbody, tfoot, thead, tr, td, th
unexpected-end-tag An end tag whose tag name is one of: caption, table,
tbody, tfoot, thead, tr, td, th
13.2.6.4.18 The "in template" insertion mode
unexpected-eof An end-of-file token
13.2.6.4.19 The "after body" insertion mode
unexpected-doctype DOCTYPE token
unexpected-end-tag-in-fragment An end tag whose tag name is html
unexpected-content-after-body Anything else
13.2.6.4.20 The "in frameset" insertion mode
unexpected-doctype DOCTYPE token
unexpected-end-tag-in-fragment An end tag whose tag name is frameset
unexpected-eof An end-of-file token
unexpected-content-in-frameset Anything else
13.2.6.4.21 The "after frameset" insertion mode
unexpected-doctype DOCTYPE token
unexpected-content-after-frameset Anything else
13.2.6.4.22 The "after after body" insertion mode
unexpected-content-after-after-body Anything else
13.2.6.4.23 The "after after frameset" insertion mode
13.2.6.5 The rules for parsing tokens in foreign content
unexpected-null-character A character token that is U+0000 NULL
unexpected-doctype DOCTYPE token
html-in-foreign-content A start tag whose tag name is one of: b, big,
blockquote, ...; A start tag whose tag name is font, if the token has any
attributes named color, face, or size; An end tag whose tag name is br p
foreign-end-tag-does-not-match-start-tag Any other end tag
For reference, here's how many times each error code appears in the above list
With my editor hat on, I don't know the parser spec well enough to comment in detail or answer the specific questions you pose to us. (Hopefully the above group can be more helpful.) But I do want to enthusiastically support this work.
I think this would be the final part of #1339, which is worth reviewing if you haven't seen it already.
What is the issue with the HTML Standard?
The tokenization stage of parsing HTML contains error codes. The tree-construction stage does not have separate error codes and instead just has generic errors.
I'd like to fix that by providing error codes for the tree construction.
To that end, I've searched through the tree construction section of the standard and identified 118 separate parse errors and made a first pass at giving some names to these. I also separated one of the errors into two separate ones since a unified error for the two cases (unexpected-start-tag and unexpected-end-tag) that were grouped together because all they did was emit an error. Therefore, the list below contains 119 instances of parse error, separated by section and presented in order.
This is just a first attempt and I'd appreciate feedback. Specifically, should some of the generic error codes be replaced with more specific codes? Should some of the specific codes be rolled into a more generic error code? Are there better names for the error codes?
My interest in this is as a maintainer of Nokogiri and I'd like to test that its HTML parser is correctly producing each specific error.
13.2.6.1 Creating and inserting nodes
13.2.6.4.1 The "initial" insertion mode
13.2.6.4.2 The "before html" insertion mode
13.2.6.4.3 The "before head" insertion mode
13.2.6.4.4 The "in head" insertion mode
13.2.6.4.5 The "in head noscript" insertion mode
13.2.6.4.6 The "after head" insertion mode
bgsound, link, meta, noframes, script, style, template, title
13.2.6.4.7 The "in body" insertion mode
clashes with the tokenizer error)
aside, blockquote, button, center, details, dialog, dir, div, dl, fieldset,
figcaption, figure, footer, header, hgroup, listing, main, menu, nav, ol,
pre, search, section, summary, ul
article, ...
h4, h5, h6
h5, h6
marquee, object
marquee, object
colgroup, frame, head, tbody, td, tfoot, th, thead, tr
13.2.6.4.8 The "text" insertion mode
unexpected-eof An end-of-file token
13.2.6.4.9 The "in table" insertion mode
colgroup, html, tbody, td, tfoot, th, thead, tr
13.2.6.4.10 The "in table text" insertion mode
13.2.6.4.11 The "in caption" insertion mode
colgroup, tbody, td, tfoot, th, thead, tr; An end tag whose tag name is
table
colgroup, tbody, td, tfoot, th, thead, thr; An end tag whose tag name is
table
html, tbody, td, tfoot, th, thead, tr
13.2.6.4.12 The "in column group" insertion mode
13.2.6.4.13 The "in table body" insertion mode
tfoot, thead
caption, col, colgroup, tbody, tfoot, thead; An end tag whose tag name is
table
colgroup, html, td, th, tr
13.2.6.4.14 The "in row" insertion mode
tbody, tfoot, thead, tr; An end tag whose tag name is table
colgroup, html, td, th
13.2.6.4.15 The "in cell" insertion mode
colgroup, html
tbody, tfoot, thead, tr
13.2.6.4.16 The "in select" insertion mode
textarea
13.2.6.4.17 The "in select in table" insertion mode
tbody, tfoot, thead, tr, td, th
tbody, tfoot, thead, tr, td, th
13.2.6.4.18 The "in template" insertion mode
13.2.6.4.19 The "after body" insertion mode
13.2.6.4.20 The "in frameset" insertion mode
13.2.6.4.21 The "after frameset" insertion mode
13.2.6.4.22 The "after after body" insertion mode
13.2.6.4.23 The "after after frameset" insertion mode
13.2.6.5 The rules for parsing tokens in foreign content
blockquote, ...; A start tag whose tag name is font, if the token has any
attributes named color, face, or size; An end tag whose tag name is br p
For reference, here's how many times each error code appears in the above list
The text was updated successfully, but these errors were encountered: