Both start and end tags of html / head / body elements can't be omitted #98

abotalov · 2015-04-27T20:44:44Z

The HTML5, HTML5.1, WHATWG HTML specs say:

An html element's start tag may be omitted if the first thing inside the html element is not a comment.
An html element's end tag may be omitted if the html element is not immediately followed by a comment.

A head element's start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
A head element's end tag may be omitted if the head element is not immediately followed by a space character or a comment.

A body element's start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.
A body element's end tag may be omitted if the body element is not immediately followed by a comment.

However, that "feature" isn't supported. html / head / body elements don't seem to inserted to DOM by oga if both start and end tags were omitted.

yorickpeterse · 2015-04-27T21:56:14Z

This is correct. In an initial revision of the HTML handling system the lexer did automatically insert html/head/body tags whenever needed. After thinking about this for a while I decided to remove this as ultimately it leads to unexpected behaviour. To explain this, when parsing XML/HTML there are two kinds of inputs:

Full blown documents (e.g. entire web pages)
Fragments of data (e.g. just a <form> tag)

Nokogiri supports this distinction in the form of Nokogiri.HTML() and Nokogiri::HTML.fragment(). When using Nokogiri.HTML() any missing html/body/head tags as well as doctypes are inserted automatically, when using the fragment method this is not the case.

The problem of this is that it complicates using the library. One has to think "am I parsing a document or a fragment?" every time they want to do something with HTML/XML. This distinction also complicates the lexing phase as the lexer now has to include extra support based on some sort of flag (e.g. :document => true or :fragment => false).

If one were to not be aware (or simply not expect) the above distinction this would lead to unexpected behaviour. For example, say somebody is parsing the following snippet and wants to remove the class attribute:

document = Oga.parse_html('<p class="example">Hello</p>')
p = document.children[0]
p.unset('class')

They then serialize the document back to XML and lo and behold they get this:

<html>
    <body>
        <p>Hello</p>
    </body>
</html>

This is very different compared to just receiving <p>Hello</p> as output.

One of the goals I have is that Oga does not return unexpected output. For example, Oga does not automatically add doctypes (unlike Nokogiri) or XML declarations. For that exact same reason I opted to not automatically add html/body/head tags even if the HTML5 specification says otherwise.

I intend to document this choice, but it seems you beat me to it before I could write it down :)

abotalov · 2015-05-12T19:17:13Z

Do you think it makes sense to insert start tags if end tags are present? (in situations where it should be done according to HTML spec)

Oga.parse_html('</html>')

yorickpeterse · 2015-05-12T19:29:39Z

@abotalov This is currently not possible, and I don't think I'll be adding this any time soon. Oga only tracks the names of opening tags (https://github.com/YorickPeterse/oga/blob/7d9604fd932ac9a5f78e68908390f758e12ed543/lib/oga/xml/lexer.rb#L413 vs https://github.com/YorickPeterse/oga/blob/7d9604fd932ac9a5f78e68908390f758e12ed543/lib/oga/xml/lexer.rb#L479). Changing this will introduce a pretty hefty performance pentalty (due to extra string allocations) and I'd rather not do that any time soon.

Besides this I can't really think of any use cases where this would be useful.

pcasaretto · 2017-01-03T14:05:53Z

Nokogiri was driving me crazy assuming too much either adding tags when using full docs or removing them when using fragments.
Thanks for this! 🍻

abotalov changed the title ~~Both start and end tags of html / head / body / colgroup / tbody elements can't be omitted~~ Both start and end tags of html / head / body elements can't be omitted Apr 27, 2015

yorickpeterse closed this as completed Apr 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Both start and end tags of html / head / body elements can't be omitted #98

Both start and end tags of html / head / body elements can't be omitted #98

abotalov commented Apr 27, 2015

yorickpeterse commented Apr 27, 2015

abotalov commented May 12, 2015

yorickpeterse commented May 12, 2015

pcasaretto commented Jan 3, 2017

Both start and end tags of html / head / body elements can't be omitted #98

Both start and end tags of html / head / body elements can't be omitted #98

Comments

abotalov commented Apr 27, 2015

yorickpeterse commented Apr 27, 2015

abotalov commented May 12, 2015

yorickpeterse commented May 12, 2015

pcasaretto commented Jan 3, 2017