Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Both start and end tags of html / head / body elements can't be omitted #98

Closed
abotalov opened this issue Apr 27, 2015 · 4 comments
Closed

Comments

@abotalov
Copy link
Contributor

The HTML5, HTML5.1, WHATWG HTML specs say:

An html element's start tag may be omitted if the first thing inside the html element is not a comment.
An html element's end tag may be omitted if the html element is not immediately followed by a comment.

A head element's start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
A head element's end tag may be omitted if the head element is not immediately followed by a space character or a comment.

A body element's start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.
A body element's end tag may be omitted if the body element is not immediately followed by a comment.

However, that "feature" isn't supported. html / head / body elements don't seem to inserted to DOM by oga if both start and end tags were omitted.

@abotalov abotalov changed the title Both start and end tags of html / head / body / colgroup / tbody elements can't be omitted Both start and end tags of html / head / body elements can't be omitted Apr 27, 2015
@yorickpeterse
Copy link
Owner

This is correct. In an initial revision of the HTML handling system the lexer did automatically insert html/head/body tags whenever needed. After thinking about this for a while I decided to remove this as ultimately it leads to unexpected behaviour. To explain this, when parsing XML/HTML there are two kinds of inputs:

  • Full blown documents (e.g. entire web pages)
  • Fragments of data (e.g. just a <form> tag)

Nokogiri supports this distinction in the form of Nokogiri.HTML() and Nokogiri::HTML.fragment(). When using Nokogiri.HTML() any missing html/body/head tags as well as doctypes are inserted automatically, when using the fragment method this is not the case.

The problem of this is that it complicates using the library. One has to think "am I parsing a document or a fragment?" every time they want to do something with HTML/XML. This distinction also complicates the lexing phase as the lexer now has to include extra support based on some sort of flag (e.g. :document => true or :fragment => false).

If one were to not be aware (or simply not expect) the above distinction this would lead to unexpected behaviour. For example, say somebody is parsing the following snippet and wants to remove the class attribute:

document = Oga.parse_html('<p class="example">Hello</p>')
p = document.children[0]
p.unset('class')

They then serialize the document back to XML and lo and behold they get this:

<html>
    <body>
        <p>Hello</p>
    </body>
</html>

This is very different compared to just receiving <p>Hello</p> as output.

One of the goals I have is that Oga does not return unexpected output. For example, Oga does not automatically add doctypes (unlike Nokogiri) or XML declarations. For that exact same reason I opted to not automatically add html/body/head tags even if the HTML5 specification says otherwise.

I intend to document this choice, but it seems you beat me to it before I could write it down :)

@abotalov
Copy link
Contributor Author

Do you think it makes sense to insert start tags if end tags are present? (in situations where it should be done according to HTML spec)

Oga.parse_html('</html>')

@yorickpeterse
Copy link
Owner

@abotalov This is currently not possible, and I don't think I'll be adding this any time soon. Oga only tracks the names of opening tags (https://github.com/YorickPeterse/oga/blob/7d9604fd932ac9a5f78e68908390f758e12ed543/lib/oga/xml/lexer.rb#L413 vs https://github.com/YorickPeterse/oga/blob/7d9604fd932ac9a5f78e68908390f758e12ed543/lib/oga/xml/lexer.rb#L479). Changing this will introduce a pretty hefty performance pentalty (due to extra string allocations) and I'd rather not do that any time soon.

Besides this I can't really think of any use cases where this would be useful.

@pcasaretto
Copy link

Nokogiri was driving me crazy assuming too much either adding tags when using full docs or removing them when using fragments.
Thanks for this! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants