Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error codes for parse errors during tree construction #10397

Open
stevecheckoway opened this issue Jun 5, 2024 · 2 comments
Open

Error codes for parse errors during tree construction #10397

stevecheckoway opened this issue Jun 5, 2024 · 2 comments

Comments

@stevecheckoway
Copy link

stevecheckoway commented Jun 5, 2024

What is the issue with the HTML Standard?

The tokenization stage of parsing HTML contains error codes. The tree-construction stage does not have separate error codes and instead just has generic errors.

I'd like to fix that by providing error codes for the tree construction.

To that end, I've searched through the tree construction section of the standard and identified 118 separate parse errors and made a first pass at giving some names to these. I also separated one of the errors into two separate ones since a unified error for the two cases (unexpected-start-tag and unexpected-end-tag) that were grouped together because all they did was emit an error. Therefore, the list below contains 119 instances of parse error, separated by section and presented in order.

This is just a first attempt and I'd appreciate feedback. Specifically, should some of the generic error codes be replaced with more specific codes? Should some of the specific codes be rolled into a more generic error code? Are there better names for the error codes?

My interest in this is as a maintainer of Nokogiri and I'd like to test that its HTML parser is correctly producing each specific error.

13.2.6.1 Creating and inserting nodes

  • xmlns-value-does-not-match-element-namespace step 12
  • xmlns-xlink-value-not-xlink-namespace step 12

13.2.6.4.1 The "initial" insertion mode

  • doctype-invalid DOCTYPE token
  • document-not-iframe-srcdoc-document Anything else

13.2.6.4.2 The "before html" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-end-tag Any other end tag

13.2.6.4.3 The "before head" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-end-tag Any other end tag

13.2.6.4.4 The "in head" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-end-tag An end tag whose tag name is template
  • unexpected-open-elements An end tag whose tag name is template
  • unexpected-start-tag A start tag whose tag name is head
  • unexpected-end-tag Any other end tag

13.2.6.4.5 The "in head noscript" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-end-tag Any other end tag
  • unexpected-token Anything else

13.2.6.4.6 The "after head" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-start-tag A start tag whose tag name is one of: base, basefont,
    bgsound, link, meta, noframes, script, style, template, title
  • unexpected-end-tag Any other end tag

13.2.6.4.7 The "in body" insertion mode

  • unexpected-null-character A character token that is U+0000 NULL (name
    clashes with the tokenizer error)
  • unexpected-doctype DOCTYPE token
  • unexpected-start-tag A start tag whose tag name is html
  • unexpected-start-tag A start tag whose tag name is body
  • unexpected-start-tag A start tag whose tag name is frameset
  • unexpected-eof An end-of-file token
  • body-element-not-in-scope An end tag whose tag name is body
  • unexpected-open-elements An end tag whose tag name is body
  • body-element-not-in-scope An end tag whose tag name is html
  • unexpected-open-elements An end tag whose tag name is body
  • nested-heading-elements A start tag whose tag name is one of: h1, h2, h3, h4, h5, h6
  • nested-form-elements A start tag whose tag name is form
  • unexpected-open-elements A start tag whose tag name is li
  • unexpected-open-elements A start tag whose tag name is one of: dd, dt
  • unexpected-open-elements A start tag whose tag name is one of: dd, dt
  • nested-button-elements A start tag whose tag name is button
  • unexpected-end-tag An end tag whose tag name is one of address, article,
    aside, blockquote, button, center, details, dialog, dir, div, dl, fieldset,
    figcaption, figure, footer, header, hgroup, listing, main, menu, nav, ol,
    pre, search, section, summary, ul
  • unexpected-open-elements An end tag whose tag name is one of address,
    article, ...
  • unexpected-end-tag An end tag whose tag name is form
  • unexpected-open-elements An end tag whose tag name is form
  • unexpected-end-tag An end tag whose tag name is form
  • unexpected-open-elements An end tag whose tag name is form
  • no-matching-element-in-scope An end tag whose tag name is p
  • no-matching-element-in-scope An end tag whose tag name is li
  • unexpected-open-elements An end tag whose tag name is li
  • no-matching-element-in-scope An end tag whose tag name is one of: dd, dt
  • unexpected-open-elements An end tag whose tag name is dd, dt
  • no-heading-element-in-scope An end tag whose tag name is one of: h1, h2, h3,
    h4, h5, h6
  • unexpected-open-elements An end tag whose tag name is one of: h1, h2, h3, h4,
    h5, h6
  • nested-a-elements A start tag whose tag name is a
  • nobr-element-in-scope A start tag whose tag name is nobr
  • no-matching-element-in-scope An end tag whose tag name is one of: applet,
    marquee, object
  • unexpected-open-elements An end tag whose tag name is one of: applet,
    marquee, object
  • br-end-tag An end tag whose tag name is br
  • image-start-tag A start tag whose tag name is image
  • parent-not-ruby A start tag whose tag name is one of rb, rtc
  • parent-not-ruby A start tag whose tag name is one of rp, rt
  • unexpected-start-tag A start tag whose tag name is one of: caption, col,
    colgroup, frame, head, tbody, td, tfoot, th, thead, tr
  • unexpected-end-tag Any other end tag
  • special-element Any other end tag
  • no-matching-open-element adoption agency algorithm step 4
  • no-matching-element-in-scope adoption agency algorithm step 5
  • unexpected-open-elements adoption agency algorithm step 6

13.2.6.4.8 The "text" insertion mode

unexpected-eof An end-of-file token

13.2.6.4.9 The "in table" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-start-tag A start tag whose tag name is table
  • no-matching-element-in-scope An end tag whose tag name is table
  • unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
    colgroup, html, tbody, td, tfoot, th, thead, tr
  • non-hidden-input-element A start tag whose tag name is input
  • unexpected-start-tag A start tag whose tag name is form
  • unexpected-content-in-table Anything else

13.2.6.4.10 The "in table text" insertion mode

  • unexpected-null-character A character token that is U+0000 NULL
  • unexpected-non-space-character-in-table Anything else

13.2.6.4.11 The "in caption" insertion mode

  • no-caption-in-scope An end tag whose tag name is caption
  • unexpected-open-elements An end tag whose tag name is caption
  • no-caption-in-scope A start tag whose tag name is one of: caption, col,
    colgroup, tbody, td, tfoot, th, thead, tr; An end tag whose tag name is
    table
  • unexpected-open-elements An end tag whose tag name is one of: caption, col,
    colgroup, tbody, td, tfoot, th, thead, thr; An end tag whose tag name is
    table
  • unexpected-end-tag An end tag whose tag name is one of body, col, colgroup,
    html, tbody, td, tfoot, th, thead, tr

13.2.6.4.12 The "in column group" insertion mode

  • unexpected-doctype DOCTYPE token
  • not-colgroup An end tag whose tag name is colgroup
  • unexpected-end-tag An end tag whose tag name is col
  • not-colgroup Anything else

13.2.6.4.13 The "in table body" insertion mode

  • unexpected-start-tag A start tag whose tag name is one of th, td
  • no-matching-element-in-scope An end tag whose tag name is one of: tbody,
    tfoot, thead
  • no-table-sectioning-element-in-scope A start tag whose tag name is one of:
    caption, col, colgroup, tbody, tfoot, thead; An end tag whose tag name is
    table
  • unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
    colgroup, html, td, th, tr

13.2.6.4.14 The "in row" insertion mode

  • no-tr-in-scope An end tag whose tag name is tr
  • no-tr-in-scope A start tag whose tag name is one of: caption, col, colgroup,
    tbody, tfoot, thead, tr; An end tag whose tag name is table
  • no-tr-in-scope An end tag whose tag name is one of: tbody, tfoot, thead
  • unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
    colgroup, html, td, th

13.2.6.4.15 The "in cell" insertion mode

  • no-matching-element-in-scope An end tag whose tag name is one of: td, th
  • unexpected-open-elements An end tag whose tag name is one of: td, th
  • unexpected-end-tag An end tag whose tag name is one of: body, caption, col,
    colgroup, html
  • no-matching-element-in-scope An end tag whose tag name is one of: table,
    tbody, tfoot, thead, tr
  • unexpected-open-elements close the cell step 2

13.2.6.4.16 The "in select" insertion mode

  • unexpected-null-character A character token that is U+0000 NULL
  • unexpected-doctype DOCTYPE token
  • not-optgroup An end tag whose tag name is optgroup
  • not-option An end tag whose tag name is option
  • no-matching-element-in-scope An end tag whose tag name is select
  • unexpected-start-tag A start tag whose tag name is select
  • unexpected-start-tag A start tag whose tag name is one of: input, keygen,
    textarea
  • unexpected-content-in-select Anything else

13.2.6.4.17 The "in select in table" insertion mode

  • unexpected-start-tag A start tag whose tag name is one of: caption, table,
    tbody, tfoot, thead, tr, td, th
  • unexpected-end-tag An end tag whose tag name is one of: caption, table,
    tbody, tfoot, thead, tr, td, th

13.2.6.4.18 The "in template" insertion mode

  • unexpected-eof An end-of-file token

13.2.6.4.19 The "after body" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-end-tag-in-fragment An end tag whose tag name is html
  • unexpected-content-after-body Anything else

13.2.6.4.20 The "in frameset" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-end-tag-in-fragment An end tag whose tag name is frameset
  • unexpected-eof An end-of-file token
  • unexpected-content-in-frameset Anything else

13.2.6.4.21 The "after frameset" insertion mode

  • unexpected-doctype DOCTYPE token
  • unexpected-content-after-frameset Anything else

13.2.6.4.22 The "after after body" insertion mode

  • unexpected-content-after-after-body Anything else

13.2.6.4.23 The "after after frameset" insertion mode

  • unexpected-content-after-after-frameset Anything else

13.2.6.5 The rules for parsing tokens in foreign content

  • unexpected-null-character A character token that is U+0000 NULL
  • unexpected-doctype DOCTYPE token
  • html-in-foreign-content A start tag whose tag name is one of: b, big,
    blockquote, ...; A start tag whose tag name is font, if the token has any
    attributes named color, face, or size; An end tag whose tag name is br p
  • foreign-end-tag-does-not-match-start-tag Any other end tag

For reference, here's how many times each error code appears in the above list

     18 unexpected-open-elements
     17 unexpected-end-tag
     13 unexpected-doctype
     12 unexpected-start-tag
     10 no-matching-element-in-scope
      4 unexpected-null-character
      3 unexpected-eof
      3 no-tr-in-scope
      2 unexpected-end-tag-in-fragment
      2 parent-not-ruby
      2 not-colgroup
      2 no-caption-in-scope
      2 body-element-not-in-scope
      1 xmlns-xlink-value-not-xlink-namespace
      1 xmlns-value-does-not-match-element-namespace
      1 unexpected-token
      1 unexpected-non-space-character-in-table
      1 unexpected-content-in-table
      1 unexpected-content-in-select
      1 unexpected-content-in-frameset
      1 unexpected-content-after-frameset
      1 unexpected-content-after-body
      1 unexpected-content-after-after-frameset
      1 unexpected-content-after-after-body
      1 special-element
      1 not-option
      1 not-optgroup
      1 non-hidden-input-element
      1 nobr-element-in-scope
      1 no-table-sectioning-element-in-scope
      1 no-matching-open-element
      1 no-heading-element-in-scope
      1 nested-heading-elements
      1 nested-form-elements
      1 nested-button-elements
      1 nested-a-elements
      1 image-start-tag
      1 html-in-foreign-content
      1 foreign-end-tag-does-not-match-start-tag
      1 document-not-iframe-srcdoc-document
      1 doctype-invalid
      1 br-end-tag
@domenic
Copy link
Member

domenic commented Jun 6, 2024

/cc @whatwg/html-parser

With my editor hat on, I don't know the parser spec well enough to comment in detail or answer the specific questions you pose to us. (Hopefully the above group can be more helpful.) But I do want to enthusiastically support this work.

I think this would be the final part of #1339, which is worth reviewing if you haven't seen it already.

@stevecheckoway
Copy link
Author

@domenic Thanks for pointing out #1339. I had searched for parse errors in the issues list but missed that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants