Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse wbr element as singleton element #1044

Closed
lulalala opened this issue Feb 7, 2014 · 2 comments
Closed

Parse wbr element as singleton element #1044

lulalala opened this issue Feb 7, 2014 · 2 comments

Comments

@lulalala
Copy link

lulalala commented Feb 7, 2014

Nokogiri treats wbr element as requiring closing tag, but the HTML5 spec says it does not need end tag. http://www.w3.org/TR/html5/text-level-semantics.html#the-wbr-element

We become aware of this when excess number of wbr elements are used, Nokogiri will then hit a nesting limit and discard contents after that.

https://gist.github.com/lulalala/8857130 is an example of this. The original HTML and the parsed version are different in that texts after the long cluster of wbr elements are omitted, probably due to too many nesting of wbr elements.

@pierpaolofrasa-twt
Copy link

This seems to affect other tags as well, e.g. source - see rubys/nokogumbo#14

@flavorjones
Copy link
Member

Nokogiri's underlying HTML parsers (libxml2 for CRuby, nekoHTML for JRuby) are HTML4 parsers, and so for some time there hasn't been much we can do to help with HTML5 other than to recommend that people use the Nokogumbo gem, which extends Nokogiri's API and provides an HTML5 parser.

I'm happy to let you know that #2204 is driving the merger of Nokogumbo and its HTML5 parser, and so Nokogiri v1.12 will support HTML5 once it is release. Please follow that issue for status updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants